DOCUMENT RESUME 

ED 359 268 TM 020 113 



TITLE 



INSTITUTION 

REPORT NO 
PUB DATE 
NOTE 

AVAILABLE FROM 



PUB TYPE 



Educational Achievement Standards: NAGB's Approach 
Yields Misleading Interpretations ♦ Report to 
Congressional Requesters ♦ 

General Accounting Office, Washington, DC. Program 

Evaluation and Methodology Div. 

GAO/PEMD-93-12 

Jun 93 

120p. 

U.S. General Accounting Office, P.O. Box 6015, 
Gaithersburg, MD 20884-6015 (first copy free; $2 for 
additional copies; orders for 100 or more to one 
address are discounted 25 percent) . 
Reports - Evaluative/Feasibility (142) 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF01/PC05 Plus Postage. 

Academic Achievement; * Academic Standards; 
Educational Policy; Elementary Secondary Education; 
Evaluation Methods ; Evaluators ; "Mathematics 
Achievement; ^Measurement Techniques; National 
Surveys: Research Problems; -''Scores; Student 
Evaluation; Test Interpretation; Test Results; *Test 
Validity 

General Accounting Office; ^National Assessment 
Governing Board; National Assessment of Educational 
Progress; ^Standard Setting 



ABSTRACT 

In September 1991, the National Assessment Governing 
Board (NAGB) announced standards for basic, proficient, and advanced 
achievement in mathematics and reported that few American students 
had reached these standards. Expert reviewers noted technical 
problems with the NAGB approach and questioned its results. In this 
report, the NAGB standard-setting approach and ability to provide 
piolicy guidance to the National Assessment of Education Progress 
(NAEP) are examined. The NAEP test-score standards set in 1990 were 
evaluated by examining the adequacy of item- judgment procedures and 
by studying whether the evidence supported NAGB 1 s interpretation of 
the NAEP scores selected for each level. The investigation found that 
the standard-setting approach was procedurally flawed, and that the 
interpretations of the resulting NAEP scores were of doubtful 
validity. The NAGB improved its procedures substantially in 1992, but 
the issue of the validity of interpretation remains. The report 
concludes that the NAGB approach is unsuited for the NAEP. 
Alternative approaches are reviewed, but it is apparent that their 
use will be difficult as the NAEP is currently designed. Specific 
recommendations are given to help implement these alternative 
approaches. Six tables and three figures illustrate the discussion. 
Appendixes include comments from the U.S. Department of Education and 
from the NAGB, a summary description of the NAEP and other 
supplementary materials. A four-part bibliography is provided. 
Contains 44 references. (SLD) 
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Executive Summary 



In September 1991, the National Assessment Governing Board (nagb) 
announced standards for basic, proficient, and advanced achievement in 
mathematics as measured by the National Assessment of Educational 
Progress (naep) and reported that few American students had reached 
these standards. This finding resulted from an approach to 
standard-setting that had several novel features. Expert reviewers noted 
technical problems with the approach and questioned its results, nagb 
acknowledged that its procedures were imperfect but considered the 
results sufficiently sound to publish and the approach sufficiently 
promising to be mandated as the p/imary basis of all future naep reporting. 

The question of how to set standards for educational achievement and 
measure progress toward them is currently of great interest, and nagb's 
approach may serve as a model for other efforts. In view of the 
controversy surrounding this approach, the chairmen of the House 
Education and Labor Committee and the Subcommittee on Elementaiy, 
Secondary, and Vocational Education asked gao to evaluate (1) its 
strengths and weaknesses, (2) its suitability and that of alternative 
approaches for use with naep, and (3) nagb's capability to provide 
technically sound policy guidance to naep. 



Rfl rktfrni in H Funded by the Department of Education, administered by the National 

DciCKgl UUIIU Center for Education Statistics (noes), and implemented by a technical 

contractor, naep tests American students in basic subjects every few years 
and estimates student achievement at the national level based on complex 
statistical techniques, naep's statutory purposes are to describe 
achievement and to track changes over time. For the past two decades, 
naep's results have been reported without reference to any goals or 
standards of how students ought to perform. 

In 1988, the Congress created nagb, an independent aiid broadly 
representative governing board, to provide policy guidance for the 
assessment. The 1988 law auo made nagb responsible for identifying 
appropriate achievement goals for each subject and grade tested. In the 
hope of interpr eting naep results in terms of standards for what students 
should know and be able to do, nagb mandated a standard-setting 
approach that included (1) defining th^ee levels of achievement in general 
terms, (2) using expert panelists to judge how students at each level 
should do on each item on the naep mathematics test, (3) selecting a naep 
score to represent the lower border of each level, and (4) interpreting 
perfoimance at these scores in terms of the definitions and of statements 
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of what students at each level should be able to do. nagb applied this 
approach to the 1990 naep mathematics test on a trial basis and to 
mathematics, reading, and writing in 1992. 

gao evaluated the naep test-score standards nagb set in 1990 by examining 
the adequacy of nagb's item judgment procedures and whether evidence 
supported nagb's interpretation of the naep scores selected for each level 
gao also identified alternative standard-setting approaches and analyzed 
them to find which would work with the naep test as it is now designed. 
Lastly, gao reviewed how nagb made key decisions, especially how it used 
technical advice and information, in the level-setting case and two others. 



gao found that nagb's 1990 standard-setting approach was procedurally 
flawed and that the interpretations that nagb gave to the resulting naep 
scores were of doubtful validity. While the scores selected represent 
moderate, strong, and outstanding performance on the test as a whole, gao 
concluded that they do not necessarily imply that students have achieved 
the item mastery or readiness for future life, work, and study specified in 
nagb's definitions and descriptions. The difficulties evident in nagb's 1990 
achievement levels resulted in part from procedural problems but also 
from the effort to set standards of overall performance (how good is good 
enough) that would also represent standards of mastery (what students at 
each level should know and be able to do), nagb improved its 
standard-setting procedures substantially in 1992, but the critical issue of 
validity of interpretation — an issue in nagb's approach — remains 
unresolved, gao therefore concluded that nagb's approach is unsuited for 

NAEP. 

gao identified several alternative approaches that could be used to 
establish standards for overall performance on a naep test. However, any 
approach that sets standards purporting to measure mastery of particular 
subject content will be difficult to use with naep as it is currently designed. 

gao found that in the case of the achievement levels, nagb designed and 
implemented its approach without adequate technical information. In two 
other cases, however, nagb made better use of such information, gao 
concluded that nagb's composition, procedures, and relationships with the 
Department of Education are inadequate to ensure that policy guidance to 
naep will be technically sound. 
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Principal Findings 



Problems in NAGB's 
Approach 



nagb based its approach on a well-known standard-setting method but 
modified this method in untested ways, gao found nagb's 1990 procedure 
unusual in three respects: (1) the achievement levels are intended to 
reflect mastery of different types of materials, not merely differences in 
overall performance; (2) panelists applied their own individual ideas of the 
mathematics skills pertinent to each level rather than using a 
consensus-based standard; and (3) the panelists were not assisted in 
making informed judgments of how students who met the expectations for 
lower levels actually would perform on veiy difficult items. These 
departures left panelists' individual views — not informed consensus — as 
the basis for nagb's standards. 



Although there was reason to think that it might be difficult to translate 
item judgment results into a naep score at which students show both the 
overall performance and the type of mastery expected for each 
achievement level, nagb did not compare actual to expected performance 
at the scores it selected, nor did it seek evidence that the interpretations it 
gave to those scores were valid compared to other evidence of student 
performance. Finally, nagb presented its findings without advising readers 
that their validity had not been established and that problems of reliability 
had been found. 



Alternative gao found that the naep scale can be used to express standards for overall 

Standard-Setting Methods performance on grade-level materials. There are several methods for 

setting such standards; each combines judgments about desirable levels of 
performance with data about what students at various naep scores are able 
to do. Because of the way naep is now designed, the current naep scale 
score is not a good way to measure students' knowledge of specific areas 
of school content, especially not advanced material that few students are 
taught. If such measurement is desired, new tests will probably need to be 
developed, or naep will need to be redesigaed. 



NAGB's Capabilities gao concluded that nagb's strength lies in its broad representation, not in 

its technical expertise. However, the law assigns nagb responsibility for 
some functions that are clearly technical and for others that have both 
technical and policy implications. From examining three decisions, gao 
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found that when nagb recognized an issue as clearly technical, it sought 
and used expert technical advice in policy planning and sometimes in 
implementation. However, nagb initially considered the setting of 
achievement levels a policy function that it itself could perform with 
minimal technical support and did not appreciate the importance of 
verifying the validity of its score interpretations, nagb's governance 
structure and procedures neither ensure that technical issues will be 
recognized nor require that technical considerations be addressed early in 
the policy formation process, gao thus concluded that there is substantial 
continuing risk that nagb may give naep technically unsound policy 
direction. 



Rprnrnmpn^^tinn^ Since the current NAGB approach to setting standards has yielded 

iXeCOIumeiiUclllUI to unsupported interpretations of naep scores, gao recommends (1) that nagb 

withdraw its instructions to nces to publish 1992 naep results primarily in 
terms of levels of achievement, (2) that nagb and nces review the 
achievement levels approach, and (3) that they examine alternative 
approaches. 

To strengthen nagb's capacity to give sound policy direction, gao 
recommends that nagb (1) obtain nces review of proposed policies; 
(2) conform to its own policy of prescribing policy ends, not technical 
details; and (3) nominate for the testing and measurement positions on 
nagb persons who are trained in the design and analysis of large-scale 
educational tests, gao also recommends that the Congress clarify what it 
intends nagb to do with respect to achievement goals and review the 
division of responsibilities between nagb and nces, with a view toward 
concentrating nagb's efforts on the representational functions for which it 
is well designed. 



A P ontc GA0 re Q uested and received comments from both nagb and the Department 

Agency comments Qf Education (for nces). The department generally concurred with gao's 

findings and recommendations, nagb generally took issue with gao's 

analysis and conclusions and with some of the recommendations as well. 

The full text of these comments and gao's responses to them are in 

appendixes I and II. 
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Introduction 



In 1990, the National Assessment Governing Board (nagb) undertook a 
pioneering effort to set standards for student performance on the federally 
sponsored National Assessment of Educational Progress (naep). naep 
periodically tests national samples of students in grades 4, 8, and 12, 
describes their achievement in basic academic subjects, and analyzes how 
achievement has changed over time. Legislation enacted in 1988 directed 
nagb to identify achievement goals for each grade and subject tested, nagb 
responded by designing a standard-setting approach that defined three 
levels of achievement (basic, proficient, and advanced), identified a test 
score to serve as a performance standard for each level, and described 
what students who meet each standard should know and be able to do. 

nagb applied its approach to the 1990 naep assessment in mathematics, a 
subject area that has recently been of great concern and one that received 
special attention in the national education goals adopted by the president 
and the nation's governors in 1989. Using its performance standards to 
analyze 1990 naep test results, nagb found that over one third of the 
students in each grade did not reach even the basic level of achievement in 
mathematics, which connotes partial mastery of fundamental skills. 
According to nagb's analysis, just under half of the students tested in each 
grade had reached the basic level but could not be considered 
proficient— had not mastered challenging material — for their grade. 
Between 15 and 19 percent had scored at the proficient level or higher, 
and only 1 to 3 percent scored high enough to be considered advanced. 1 

These findings received wide attention. The National Education Goals 
Panel, a bipartisan associition of governors, senior national 
administration officials, and congressional representatives established to 
monitor and report progress toward the national educational goals, 
incorporated nagb's findings into its first report. Using the performance 
standard for the proficient level as its benchmark, the panel interpreted 
nagb's findings to mean that less than one student in five had attained the 
national goal of demonstrating competency in mathematics. 2 

Both nagb's approach and the findings that flowed from it have been 
controversial. Some users of naep data have applauded nagb for providing 
standards where there had been none. Experts in testing and 
measurement, however, have noted possible flaws in nagb's approach that 



National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 1 , National and 
State Summaries (Washington, D.C.: 1991), p. 34. 

2 National Education Goals Panel, The National Education Goals Report: Building a Nation of Learners, 
1991 (Washington, D.C.: 1991), p. l£ 
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could lead to the selection of naep scores that do not accurately represent 
the three achievement levels and have recommended that the approach be 
reexamined. 

The chairmen of the House Committee on Education and Labor and the 
Subcommittee on Elementary, Secondary, and Vocational Education asked 
us to conduct an independent evaluation of nagb's approach to setting 
standards through naep. As the request letter noted, the question of how to 
set standards for educational achievement and measure progress toward 
them is currently of great interest, and nagb's approach may serve as a 
model for other efforts. It is therefore important that both its strengths and 
possible weaknesses be identified and made public. Moreover, naep will 
undoubtedly play a central role in a system of assessments aligned to 
national content standards such as was recommended by the 
congressionally mandated National Council on Education Standards and 
Testing (ncest). As the ncest report observes, it is critical to ensure that 
assessments support valid, reliable, and fair measurement of the standards 
and that students are protected against unintended negative consequences 
of assessment approaches that are still being refined. 3 



naep has measured student achievement in core academic subjects every 
few years since 1969. Initially funded by grants, the assessment was 
authorized by statute in 1978 and was added to the responsibilities of the 
National Center for Education Statistics (nces), within what is now the 
Department of Education, nces carries out the assessment with the 
assistance of a technical contractor, currently the Educational Testing 
Service (ets). Until 1990, naep tested a nationally representative sample of 
students and reported only national and regional results, naep assessments 
are designed by broad-based consensus groups and emphasize material 
that is commonly taught for each grade and subject tested. (Summary 
information about naep can be found in appendix HI.) Traditionally, naep 
reports have simply described what students can do; they have not 
prescribed norms or standards of what students should be able to do. 

In 1988, amendments to naep's authorizing statute expanded the 
assessment to include state-level testing and established a new governance 
structure (shown in appendix IV). Under this new structure, the 



National Council on Education Standards and Testing, Raising Standar ds for American Education 
(Washington, D.C.: 1992). NAEP does not report results for, and thus cannot have consequences for, 
individual students. However, NAEP data may contribute to decisions about educational policy. If 
NAEP cannot adequately measure performance against educational standards or if it does so 
inaccurately, these decisions could be misguided. 
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Commissioner of Education Statistics, who heads nces, retains 
responsibility for naep operations and for technical quality control nagb, a 
governin&board appointed by the Secretary of Education but independent 
of the department, provides policy guidance for naep. nagb's composition 
is set by statute. Its 23 members include governors and other state 
officials, district officials, teachers, principals, noneducators, and two 
testing and measurement experts. 

In addition to providing policy guidance, nagb is responsible for specific 
functions formerly performed by a panel that advised the naep contractor, 
including selecting the subject areas to be addressed, ensuring that each 
assessment's content is planned through a national consensus, and 
developing guidelines for reporting. The 1988 amendments also gave nagb 
a special responsibility, that of "identifying appropriate achievement 
goals" for each subject and grade tested. 



The legislative record provided no guidance to assist nagb in interpreting 
congressional intent with respect to the responsibility to identify 
achievement goals. Educ ational practice offered little guidance, either: the 
idea of setting goals or standards for student performance on a 
broad-based assessment like naep (as opposed to passing scores on tests 
of individual achievement) was relatively new. There was no established 
meaning for the term "achievement goal." Indeed, achievement goals or 
standards in education might be interpreted to mean 

• content standards that identify what students should know, 

• performance standards that identify the levels of performance that 
students should achieve, and 

• performance targets that identify the percentage of students who should 
meet a performance standard. 

In developing its approach, then, nagb had to decide what kind of goal it 
wanted to establish and then consider how this might best be done. 

Development of NAGB's nagb's initial discussions, early in 1989, focused on how future naep tests 

Approach might be designed to reflect content standards as well as current 

educational practice. As events proceeded, nagb's attention shifted to the 
question of how content or performance standards might be established 
by means of tests that had already been designed or administered. 4 



^Tests to be administered in 1990 had already been designed. The reading and writing tests were to be 
redesigned for 1992, but the mathematics test design (new in 1990) was scheduled to be used through 
1094. 



NAGB s Approach to 
Identifying Achievement 
Goals 
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Standard-setting based on existing tests, if feasible, would enable naep to 
measure progress toward the national education goals as early as 1990. 

nagb conducted a preliminary review of standa r d-setting methods and 
proposed in December 1989 that item judgment procedures might be used 
to set a single performance standard that represented adequate mastery of 
"core" content for each grade tested in the 1990 assessment of 
mathematics. 6 The proposed method would identify what students should 
know— that is, it would set a content standard. In addition, it would locate 
the test score that represents qualified performance with respect to that 
content— a perfonr ance standard, nagb sought public and expert 
comment on the concept paper outlining this plan. 

After reviewing the comments, nagb concluded that an item judgment 
procedure known as the Angoff method could be applied to naep but that 
three performance levels or standards per grade would be needed: a 
challenging standard of proficiency ; a lower, basic standard to direct 
attention toward students with the greatest need for improvement; and a 
standard for "world class" advanced performance. In May 1990, nagb 
adopted a policy that defined three levels of achievement and specified 
that modified Angoff item judgment procedures oe used to set three 
performance standards for each level and grade— to find the threshold 
score on the 500-point naep scale at which the criteria for each 
achievement level were met— beginning with the 1990 mathematics 
assessment on a trial basis. 



Key Steps in the Approach The definitions of basic, proficient, and advanced achievement, which 

reflect the language of the national educational goals and are intended to 
be applicable to every subject and grade tested by naep, form the 
foundation of nagb's approach, nagb's definitions are 



"Basic. This level, below proficient, denotes partial mastery of knowledge and skills that 
^eTuRdamental for proficient work at each grade level— 4, 8, and 12. For 12th grade, this i: 
higher than minimum competency skills (which normally are taught in elementary and 
junior high schools) and coverj significant elements of standard high-school-level work. 



"Proficient. This central level represents solid academic performance for each grade 
tested— 4, 8, and 12. It reflects a consensus that students reaching this level have 
demonstrated competency over challenging subject matter and are well prepared for 
next le el of schooling. At grade 12, the proficient level encompasses a body of 



fi ln an item judgment procedure judges estimate how students who have the capabilities needed to 
meet a given standard would perform on each item on the test The judg-?nLs are cumulated across 
items to form a total score. NAGB's item judgment procedure is described below. 



ERIC 



Page II > 1 ^| GAO/PEMD-93-12 Educational Achievement Standards 



Chapter 1 
Introduction 



subject-matter knowledge and analytical skills, of cultural literacy and insight, that all high 
school graduates should have for democratic citizenship, responsible adulthood, and 
productive work. 

"Advanced. This higher level signifies superior performance beyond proficient grade-level 
mastery at grades 4, 8, and 12. For 12th grade, the advanced level shows readiness for 
rigorous college courses, advanced technical training, or employment requiring advanced 
academic achievement. As data become available, it may be based in part on international 
comparisons of academic achievement and may also be related to Advanced Placement 
and other college placement exams." 6 

Table LI shows the steps through which nagb translated these definitions 
into naep scores and interpretations of those scores, with illustrations 
drawn from nagb's procedures with respect to the 1990 8th grade basic 
level. 7 In step 2, item judgments, nagb convened panels of teachers and 
other experts and asked them to judge how students who just reached the 
basic, proficient, or advanced level should perform on each item on the 
4th grade, 8th grade, or 12th grade test. Next (step 3), nagb combined 
panelists' judgments to calculate the percentage of items on each test that 
should be answered correctly by students at the lower margin of each 
achievement level. 8 



6 NationaI Assessment Governing Board, p. 5. 

7 Our purpose here is simply to describe. We examine the logic of NAGB's approach and the validity of 
its results in chapter 2. 

*To illustrate, suppose that there are five questions on the test Panelist A expects that 55, 80, 65, 20, 
and 70 percent of marginally basic students should give a correct answer to questions 1, 2, 3, 4, and 5, 
respectively. The average of these five estimates is 58. Fifty-eight percent correct is panelist As 
estimate of the percent correct performance standard that a marginally basic student should achieve 
on this five-item test Suppose that panelist B estimated a 46-percent standard and panelist C estimated 
53 percent The average for all three panelists, 52, would be the percent correct performance standard 
for basic achievement 



Page 12 



1 GAO/PEMD-93-12 Educational Achievement Standards 

.! 4 



Chapter 1 
Introduction 



Table 1.1: The NAGB Approach: 
Illustration 



An 



Step 



8th grade basic level 



1. Level definition 



2. Item judgments 



3. Percent correct 
performance standard 



4. NAEP score performance 
standard 



"This level, below proficient, denotes partial mastery of 
knowledge and skills that are fundamental for proficient 

work at each grade level" 

Drawing on the level definition, panelists examined each 
item on the 8th grade test and judged how many 
marginally basic students out oHOO should be expected 
to answer that item correctly • 

Each panelist's item judgments were averaged; each 
panelist adjusted his or her average to represent a "best 
guess" of what the standard should be; the average best 
guess across panelists was computed, resulting in a 
pe rformance standard of 48 percent correct 

The NAEP score of 2S5? which corresponds to 48-percent 
correct, was selected as the lower threshold of the basic 
level 



5. Percentage of students 
who met the NAEP standard 



1990 NAEP results showed scores of 255 or above for 
62.1 percent of 8th graders 



6. Illustrative items: expected 
performance 



Items that judges expected 80 percent of marginally 
basic students to answer correctly were classed as 
"basic" items 



7. Illustrative items: actual 
1990 performance 



Items actually answered correctly by 80 percent of the 
students who scored at or near the basic standard of 255 
{that is. between 242.5 and 267.5) on the 1990 
assessment were classed as "basic" items 



8. Achievement level 
description (statement of 
expected mastery) 
summarizing items common 
to steps 6 and 7 



" BASIC: Partial Mastery of Knowledge and Skills . The 
eighth-grade student performing at the basic level should 
be able to identify and use the correct operations for 
solving one- and two-step problems involving addition, 
subtraction, multiplication, and division of whole numbers 
and decimals. . . " 



Source- National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 3, 
Technical Report (Washington, D.C.: 1991). pp. 13-14. 32-33, 58-61,68. and 334. 



nagb then asked the naep technical contractor, ets, to translate the percent 
correct standard into an equivalent score on the 500-point naep proficiency 
scale for the 1990 mathematics test (step 4) and to calculate the 
proportion of U.S. students who met or exceeded each such score based 
on the 1990 test data (step 5). 9 nagb's results to this point are shown in 
table 1.2. The percent correct standards for the 4th, 8th, and 12th grades 
were very similar to one another, as were the percentages of students who 
scored at or above the basic, proficient, and advanced level. Just over 
60 percent of the students in each grade were found to have reached the 



The NAEP scale covers the full range of proficiency tested, from the least-proficient score possible for 
4th graders to the most-proficient score possible for 12th graders. 
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basic level (The 60 percent includes students who reached the two higher 
levels,) Fifteen to 18 percent reached at least the proficient level, of which 
a handful achieved sufficiently high scores to be classified as advanced. 
Scores for nearly 40 percent of the students in each grade fell below the 
standard for the basic level. 10 



Grade and achievement 
level 



Percent correct 
performance 
standard 



NAEP score 
performance 
standard 



Percentage of 
students who 
scored at or 
above the NAEP 
standard 



Table 1.2: Summary of NAG5 Results 
for1990 a 



4th grade 
Basic 
Proficient 
Advanced 

8th grade 
Basic 
Proficient 
Advanced 

12th grade 
Basic 
Proficient 
Advanced 



45% 207 63.3% 

68 245 14.9 

87 283 0.6 



48 255 62.1 

72 295 18.1 

89 336 1.0 



47 282 64.4 

73 330 16.2 

88 358 2.6 



a The achievement level results for 1990 (shown here) were revised in 1992. The revised figures 
apply the standards adopted through the 1992 standard-setting process to the 1990 test, taking 
into account differences in test composition for the two years. The revised figures are included in 
U.S. Department of Education, Office of Educational Research and Improvement, NAEP 1 992 
Mathematics Report Card for the Nation and the States (Washington, D.C.: April 1993). 

Source: National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 1, 
National and State Summaries (Washington, D.C.: 1991). 

To help users of naep data interpret performance at each score standard, 
nagb identified items that the item judgment panelists expected most 
students at each level to answer correctly (step 6) and items that most 
students who scored at each naep performance standard actually did 
answer correctly (step 7). Next (step 8), nagb created summary 
paragraphs based on items that students should and did answer. Termed 
"achievement levels descriptions," these paragraphs illustrate what 
students at each level should know and be able to do on the naep test, 
nagb's 1990 achievement level descriptions for mathematics are printed in 
appendix V. 



10 NAGB labeled these scores "below basic" but did not define or interpret "below basic" achievement 
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nagb's approach is based on an item judgment method first proposed by 
William Angoff, a specialist in testing at ets. The Angoff method has been 
widely used to set standards for individual performance on tests and is 
generally thought to be the most practical of the item judgment methods. 
Typically, the procedure is used to set a single standard or passing score 
that can be used to distinguish individuals who are qualified for some 
purpose (such as course credit, licensure, certification, or entry into a 
program of study) from those who are not qualified, on a test designed for 
that purpose, nagb's approach modified standard Angoff procedure to set 
standards of how students at three achievement levels should 
perform— not on a test that was designed to measure individual 
performance at these levels but, rather, on one designed to describe 
proficiency for the overall student population. Thus, it employed a 
well-established approach but applied it for a new purpose and in a new 
way. 



Implementation of the 
NAGB Approach 



Early in the summer of 1990, nagb retained an expert in standard-setting to 
assist its staff in conducting the item judgment procedure and analyzing its 
results, nagb arranged for a panel of teachers, university experts, business 
leaders, and citizens to meet to perform item judgments in the fall, nagb 
also retained a team of experts in testing, measurement, and evaluation to 
conduct a technical and policy evaluation of its procedures. 

The item judgment task proved more difficult than expected. The results 
from the initial panel meeting and from a follow-up meeting later in the fall 
were inconsistent and were set aside early in 1991 on the advice of nagb's 
evaluation team and other technical experts. These experts noted 
problems in nagb's implementation of Angoff procedures and commented 
that the naep mathematics test included few items that represented 
"advanced" content and, thus, provided a weak basis for measuring 
advanced achievement. 

nagb formed new panels to apply its approach with modified procedures 
during the spring of 1991. The new panels produced more consistent 
results than before, nagb's technical consultant reported that the 
achievement levels appeared technically defensible. However, nagb's 
evaluation team noted that questions concerning naep's ability to measure 
advanced achievement remained, that there were still problems with 
nagb's procedures and with the quality of the resulting data, and that the 
validity of nagb's standards had not yet been examined. They 
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recommended that the results not be presented as standards and that the 
approach not be used further until it had been thoroughly reviewed. 

nagb judged the results sufficiently sound to be usable. It arranged to 
publish the results under its own authority in September and made them 
available to the National Education Goals Panel, which reported at the 
same time. Because nagb's results were not issued as a naep report, they 
did not undergo nces technical quality review. 

nagb also designed a request for proposals to apply the approach to the 
assessments of mathematics, reading, and writing for 1992. The 1992 
contract, for $1.3 million, was awarded to an experienced testing firm, 
American College Testing (act), act has applied the nagb approach with 
procediu? ; changes that are discussed in chapter 2. 

In their final report of August 1991, members of nagb's evaluation team 
reiterated their concerns about the technical quality of nagb's approach 
and its results. In its policy evaluation, the team observed that nagb, whose 
members for the most part are not technical experts, may not be 
appropriately constituted to direct a testing program that must meet high 
standards of technical quality. 



Objectives, Scope, 
and Methodology 



In the fall of 1991, the chairmen of the House Committee on Education and 
Labor and the Subcommittee on Elementary, Secondary, and Vocational 
Education asked us to review nagb's controversial exercise of its 
achievement goals responsibility. They asked us to answer three 
questions: 

1. What are the strengths and weaknesses of nagb's standard-setting 
approach? 

2. Is nagb's approach suited for use with naep, and might alternative 
approaches provide better benchmarks for goal achievement? 

3. Are nagb's knowledge resources and procedures sufficient to ensure 
that work done at its direction and the products that result are technically 
sound and that published measures meet appropriate standards? 



Evaluating NAGB's 
Approach 



To give ourselves a basis for answering the first question, we traced the 
legislative background of nagb's achievement goals responsibility and 
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reviewed documents that recorded the development of nagb's 
standard-setting approach. We examined the minutes of nagb meetings 
from January 1989 to November 1991 and reviewed background papers, 
records of committee meetings, transcripts of public hearings, written 
testimony, technical memoranda and reports, correspondence, and nagb 
and naep publications. We reviewed documents that described the naep 
mathematics assessment for 1990 and the procedures through which naep 
scale scores are estimated. In addition, we interviewed officials at nagb, 
nces, and ets; nagb's technical consultant and members of nagb's 
evaluation team; and members of another evaluation team that reviewed 
the nagb levels in connection with the Trial State Assessment (tsa). 11 
Finally, we spoke with individuals who provided nagb with written 
comments on the approach or who spoke at public hearings on the 
subject. 

We drew our evaluation criteria from the standards issued by professional 
associations concerned with educational tests and measurement. 12 To 
understand the criteria that are likely to apply in a system of assessments 
linked to national content standards, we attended ncest meetings and 
rev i ewec i ncest background papers as well as the final report. In addition, 
we examined the tecir ical literature on methods of standard setting and 
on the application of the Angoff item judgment method. Our analysis 
compared nagb's procedures against these general and specific criteria, 
identified novel or unconventional aspects of nagb's approach and 
estimated their likely effects on test score selection and interpretation, 
used data from nagb's technical reports to test for these effects, and drew 
conclusions regarding the technical soundness of nagb's procedures. In 
addition, we checked the reasonableness of nagb's 1990 results by locating 
other indicators of student achievement and comparing them to nagb's 
findings. 

We reviewed plans for the application of nagb's approach to the 1992 
assessments and preliminary and final results from that process for the 
mathematics assessment. (Work on the reading and writing standards was 
just beginning when we finished data collection for the draft of this report 
in June 1992.) 



1 l The TSA is the state-level NAEP assessment of 8th grade mathematics undertaken on a trial basis in 
199j NAGB reported state results as well as national results in terms of the achievement levels. 
Accordingly, the NAGB levels were examined as part of the congressionally mandated TSA evaluation. 

,2 American Educational Research Association, American Psychological Association, and National 
Council on Measurement in Education, Standards for Educational and Psycholo gical Testing 
(Washington, D.C.: 1985). Similar criteria are reflected in the Code of Fair Testing Practices in 
Education issued in 1988 by the Joint Committee on Testing Practices. 
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To answer the second question, we first drew conclusions regarding the 
appropriateness of nagb's approach based on the analysis described 
above. We identified assessments of various kinds that measure student 
performance and apply standards (with an emphasis on national 
assessments) and interviewed experts involved with these possible 
alternative approaches. To evaluate the strengths and weaknesses of each 
approach for use with naep, we drew on our earlier work assessing the 
quality and use of naep data. 13 We also drew on our understanding of naep 
test design and scaling procedures and general requirements for 
technically sound description and trend reporting. We discussed the 
technical feasibility of possible alternatives with nces technical staff. 



Evaluating NAGB's To answer the third question, on whether nagb's resources are adequate to 

Technical Capacity support technically sound decisions, we examined nagb's use of technical 

information in connection with the standard-setting project, drawing on 
the records and interviews described above. We asked officials of nagb 
and nces to identify additional cases that illustrate nagb's handling of 
technical matters and reviewed records and interviewed participants in 
two such cases. To identify procedures that govern nagb's decisions, we 
examined nagb's written policies and a memorandum of understanding 
between nagb and the Department of Education. 



This study is based on published data that summarize the judgments made 
by participants in nagb's standard-setting process in the spring of 1991 and 
focuses primarily on the 8th grade, the only grade for which detailed data 
on actual student performance were available. We checked the pattern of 
item judgments for the 4th grade also and found that it was consistent with 
our analysis. We did not check the 12th grade item judgments. 

Our analysis of how students at different naep score levels perform on 
items of varying degrees of difficulty represents a new way of displaying 
what students can do and comparing it to what they should be able to do. 
Because this approach is new, its results should be considered suggestive 
rather than definitive. 14 



,3 U.S. General Accounting Office, Education Information: Changes in Funds and Priorities Have 
Affected Production and Quality , GAO/PEMD-8&4 (Washington, D.C.: November 1987). 

U NCES commented that our approach leads to conclusions regarding actual versus expected 
performance that appear puzzling in light of other kinds of evidence. It has requested its technical 
review panel to conduct studies on the topic of comparisons. 



Evaluating 

Standard-Setting Methods 
Appropriate to NAEP 



Study Limitations and 
Strengths 
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A full examination of the strengths and weaknesses of alternative methods 
for setting goals or standards through naep tests was beyond the scope of 
this study. Our study is useful as a methodological critique of nagb's 
approach, but because we could not be exhaustive in our research, we 
also cannot present definitive prescriptions without further research, 
analysis, and comparison. 

The chief strength of this study is that we examined each step of nagb's 
approach in relation to the others. Other evaluations have addressed 
specific aspects of the approach in greater depth than this study can offer, 
but none has examined the entire process and assessed its overall 
consistency. The results of our effort clearly show the importance of 
conducting a full step-by-step analysis prior to adopting a standard-setting 
procedure or accepting its results. 



<p , Responsible officials from the Department of Education and from nagb 

.gency VOmmentS commented orally on a draft of the interim report that we completed in 

March 1992. 15 nces officials generally concurred on the draft report. In oral 
comments and in correspondence after the report was issued, however, 
both the chairman of nagb and the Assistant Secretary for Educational 
Research and Improvement argued that they believed the levels were 
appropriate and useful, that we had applied overly narrow technical 
criteria to what is essentially a judgmental process, and that we did not 
sufficiently credit improvements in procedures that nagb implemented for 
1992. Since nagb selected its naep score standards on the basis of technical 
procedures, and since these scores were intended to be valid measures of 
the achievement levels that nagb had defined, we consider technical 
criteria of evaluation appropriate. We take improvement? in nagb's 1992 
procedures into account in chapter 2. 

Officials from the department (for nces) and from nagb reviewed and 
commented on a draft of this report as well. The department's comments 
and our response to them are included in appendix I, and those from nagb 
appear with our response in appendix II. 



Chapter 2 of this report answers the first of the study questions by 
Organization OI me evaluating nagb's approach as applied to the 1990 naep assessment and as 

RepOlt revised for use with the 1992 assessment. Chapter 3, responding to the 



,fi U.S. General Accounting Ofice, National Assessment Technical Quality , GAO/PEMD-92-22R 
(Washington, D.C: March 1992). 
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second question, examines how nagb's approach, modifications of that 
approach, and alternative methods might be used to set performance 
standards for use with naep. Chapter 4 examines the technical quality of 
nagb decisionmaking, and factors contributing to technical quality, in 
response to the third question. Each chapter ends with recommendations. 
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Evaluation of NAGB's Approach 



We were asked to evaluate the strengths and weaknesses of nagb's 
approach to setting achievement standards and measuring how well 
students reached them on the 1990 naep mathematics test, nagb's 
measurement effort included defining three achievement levels, using item 
judgment procedures to select a naep score for each level, and interpreting 
these naep scores in terms of the achievement level definitions and of 
statements of what students should know and be able to do. 

Our evaluation began by examining the concepts of basic, proficient, and 
advanced achievement as nagb defined them, nagb's definitions are 
judgmental standards of what students at each level should know and be 
able to do. The definitions both guide the selection of a naep score for 
each level and provide the basis for interpreting student performance. 
Thus, it is critical to determine how well the concepts nagb has defined 
correspond to what is measured by the naep scale. 

Next, we focused on nagb's item judgment and score selection procedures, 
asking whether the practices nagb followed in setting its 1990 standards 
provided an adequate basis forjudging how students at each level could be 
expected to perform on the 1990 naep mathematics test. We then used test 
score data and external evidence of student achievement zo examine 
whether the scores nagb selected can validly be interpreted in terms of 
nagb's definitions and descriptions of what students at each level should 
know and be able to do. Finally, we reviewed how nagb presented its 
results and whether it informed users of data limitations and cautioned 
against potentially unwarranted interpretations of the naep scores, as is 
appropriate whenever a new measure is introduced. 

This chapter presents our findings on each of these matters with respect to 
1990. Since nagb changed its approach when it began further level-setting 
work in 1992, we also reviewed the changes to see if any problems found 
earlier were remedied. We conclude with recommendations for further 
action by nagb and nces. 



Our first step in evaluating nagb's approach was to examine how nagb 
defined the levels of achievement: to see what nagb intended the 
achievement levels to represent. We then reviewed the naep scale, 
examined the achievement level definitions in light of what the naep scale 
measures, and identified problems that might arise in finding naep scores 
to match nagb's definitions or, conversely, in using the definitions to 
interpret performance at various naep scores. 



NAGB's Achievement 
Level Definitions as 
the Basis for NAEP 
Score Selection and 
Interpretation 
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NAGB's Definitions nagb's definitions (summarized in table 2.1) are complex. They incorporate 

three aspects or dimensions of student achievement. First, the definitions 
refer to overall performance — to how much of the naep 4th, 8th, or 12th 
grade mathematics test a student can answer. Second, the definitions 
specify which types of items (what kind of material) the students at 
different levels are expected to master. Finally, nagb's definitions link the 
achievement levels to predicted readiness for some future activity such as 
entry-level college coursework or advanced technical training. The 
standard-setting task was to find naep scores that would represent 
appropriate overall performance for each level, appropriate mastery, and 
appropriate readiness as well. This objective poses a challenge for the 
naep scale, which is designed simply to represent overall performance. 



Table 2.1: Dimensions of Achievement 
In NAGB's Levels Definitions 



NAGB ievel 



Basic 



Proficient 



Advanced 



Dimension of achievement 



Overall 
performance 
on the NAEP 
test 



Mastery expectation Predicted readiness 



Below proficient 



Partial mastery of 
fundamental skills; for 
12th grade, refers to 
standard high school 
work 



Not specified 



Represents 
solid academic 
performance for 
the grade 



Basic-level mastery, 
plus competency over 
challenging subject 
matter for the grade 



Well prepared for the 
next level of schooling 
or (in the 12th grade) 
for adult life, work, and 
citizenship 



Superior 
performance, 
beyond 
proficient 



Not specified 



For 12th grade, ready 
for rigorous college 
courses, advanced 
technical training, or 
employment 



Source: National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 1, 
National and State Summaries (Washington, D.C.: 1991), p. 5. 



NAEP Scores and Overall naep scores are a measure of overall performance on a test that covers a 
Performance wide ran S e °f material, from very easy items that nearly 100 percent of 

students are able to answer to items that are so difficult that very few are 
able to answer them. The scores reflect both the number and the difficulty 
of the items answered correctly. They summarize how much of the test a 
student can answer, giving more weight to difficult than to easy items. 
(For more information or. the naep test and scale, see appendix III.) Thus, 



ERLC 



Page 22 



GAO/PEMD-93-12 Educational Achievement Standards 



Chapter 2 

Evaluation of NAGB's Approach 



the naep scale can appropriately be used to set standards for how much 
performance (weighted by difficulty) is enough: what we call overall 
performance standards. (We discuss this type of standard in detail in 
chapter 3.) 



NAEP Scores and 
Expected Mastery 



The addition ox expectations concerning student mastery, however, adds 
complications. Here, the question of interest is how students perform on 
certain types of ite:ns : for example, how students at the basic level 
perform on the items that test fundamental skills, nagb's approach places 
considerable emphasis on identifying what students at each level should 
know and be able to do and on finding a naep score to match. But the naep 
scale measures — and nagb's approach sets standards for — performance on 
the test as a whole, not performance on any particular type of items or 
class of skills. 1 Using the naep scale to express standards of what students 
could do as well as how much they should be able to do, we thought, 
raised important issues of procedure and interpretation. 

The procedural issue was how to include all items in the item judgment 
process (as was necessary in order to use the naep scale) yet arrive at a 
standard that reflected mastery of the items that represented content 
pertinent to each level. The corresponding issue of interpretation was 
whether it was valid to interpret the attainment of the naep score selected 
for each level as evidence that the skills and knowledge expected at that 
level have been mastered (and to infer that lower scores mean that these 
skills have not been mastered). There is some question whether mastery 
inferences c*n be drawn from the naep scale under any circumstances. 2 In 
addition, we were uncertain whether nagb's procedure, which uses the 
percentage of items answered correctly to summarize the performance 
expected at each level, would locate an appropriate score for each level on 
the naep scale. 



NAEP Scores and 
Predicted Readiness 



The definitions' predictions of readiness raise further issues of test 
coverage and valid interpretation. First, naep tests are nrt intended to be 
used for prediction. They were not designed with occupational skills or 
advanced college course prerequisites in mind — the 1990 mathematics test 
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l If content mastery were the sole dimension of interest, the most obvious measurement approach 
would be to identify items from the test that are relevant to each level, determine how many of those 
items students should be able to answer correctly, and find the proportion of students who did so. We 
discuss the application of content-based performance standards to NAEP in chapter 3. 

2 See Robert A. Forsyth, "The NAEP Proficiency Scales: Do They Yield Valid Criterion-Referenced 
Interpretations?" Educational Measurement: Issues and Practice , 10 (1991), 3-9 and 16. 
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did not cover calculus, for example — so coverage of these skills or 
prerequisites (unlike coverage of grade-level material) cannot be assumed. 

Second, predictive standards cannot be based on judgment alone: they 
must be backed by factual information. Establishing a valid predictive 
standard generally requires identifying what skills and knowledge, at what 
level of difficulty, someone must master in order to be prepared for future 
success and making sure that the standard reflects performance on these 
relevant items. The alternative is to show that, content coverage aside, 
scores on the test actually predict success as claimed: for example, to 
demonstrate that, for whatever reason, a 1990 naep score of 330 (12th 
grade proficient) is the dividing line between the 12th graders who 
succeed in freshman mathematics courses or on the job and those who do 
not. (We return to this point in our discussion of predictive validity.) 



Summary All in all, nagb's definitions imply three different measurement purposes 

that are difficult to conciliate and raise fundamental questions about 
whether the achievement levels can be adequately measured by the naep 
test and scale or can be validly used to interpret performance at various 
test scores. We looked at nagb's 1990 procedures (summarized in table 
1.1) with these questions in mind. 



The Adequacy of 
NAGB's Score 
Selection Procedures 



We examined nagb's 1990 item judgment procedures to determine whether 
panelists were given the resources they needed to do their job. These 
resources include a clear understanding of what students at each level 
should be able to do with respect to the material covered on the test and a 
basis for making informed judgments of how students at different levels 
are likely to perform on the various test items. Next, we reviewed the 
steps by which the item judgment results were transformed into a score on 
the naep scale. 



NAGB's Item Judgment nagb gave panelists its definitions of basic, proficient, and advanced 

Procedures achievement and instructed them to make prescriptive judgments— to 

state how students marginally qualified for each level should perform on 
every item on the 1990 naep mathematics test— using these general 
defir^ions as a guide. We concluded that this part of the approach had 
two important weaknesses. 
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First, nagb did not provide or ask panelists to develop a clear statement of 
what students at each level should know and be able to do with respect to 
4th, 8th, and 12th grade mathematics before starting the detailed work on 
items (as is commonly recommended in the literature for the application 
of Angoff item judgment methods). Instead, nagb left it to each panelist to 
formulate and apply his or her own working definition or standard of what 
students at each level and grade should know and be able to do. 3 The 
absence of common prior standards makes it difficult to trace the 
connection between nagb's definitions and panelists' judgments and leaves 
the item judgment results themselves as the only basis for inferring the 
skills and knowledge expected of students at each level and at the 
corresponding naep score. 

The second difficulty with nagb's procedure was that panelists' judgments 
were not backed by sufficient information — information that would have 
helped them understand how students marginally qualified for each level 
would be likely to perform on different types of items. Most importantly, 
panelists were not assisted in reaching informed judgments concerning 
how students functioning at the basic level might actually perform on 
difficult items — items that such students should not (by definition) be 
expected to answer and would most likely guess at or leave unanswered. 
What percent correct judgment should be given to these difficult items? 
Inappropriately high percents correct would push the overall score higher 
with the result that it would represent performance beyond what is 
required in the definition. 

Prior experience with item judgment methods suggests that the solution to 
this problem is to have judges estimate how students marginally qualified 
for a given level will do on a question and help them by giving them data 
on how such students (students who meet the standard in question) 
actually perform on very difficult questions. If judgments are realistic in 
this sense, hard questions that do not belong in the expectations for a 
fundamental-skills level will be given a realistically low percent correct 
estimate and thus will have appropriately little effect on the overall score 
standard. 

nagb did not, however, direct panelists to make their judgments realistic or 
provide the data needed to do so. Panelists did have data on the 
percentage of all students who answered each item on the 1990 naep 
mathematics test correctly, but those are not helpful in estimating how 



3 NAGB has recognized that specification of the achievement levels in terms of mathematics was a 
problem in 1990. In its 1992 procedure, panel members developed common working definitions 
specific to the subject tested before beginning their item judgment work. 
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basic-level students, for example, perform. 4 Also, panelists were not 
assisted in understanding what percentage of students could get an item 
right simply by guessing. 5 In the interest of setting ambitious standards, 
panelists may have resolved any uncertainties by pushing up the percents 
correct, and nagb's procedures afforded little protection against such 
unrealistically high judgments. The problem here is that item judgments 
may have gone beyond the expectations stated in the definitions for the 
basic and possibly for the proficient level. 

In summary, the item judges did not start their work with a common 
framework identifying the mathematics skills and knowledge appropriate 
to each level and had little basis forjudging how students at each level 
would most likely answer difficult naep items. These gaps in nagb's 
procedures left only one basis for the item judgments: individual panelists' 
views. The question arises whether these individual views constituted a 
reliable basis for setting a national percent correct standard for each 
achievement level. 



" The Issue of Reliability A standard based on item judgments is said to be reliable if there is 

evidence that (1) individual panelists were consistent in the judgments 
they made and (2) the views represented among the panel were reasonably 
representative of those found among qualified judges generally, such that 
the average of panelists' ratings is a trustworthy estimate of a more 
general standard. The reliability of the standard (its freedom from 
measurement error because of the composition of a particular panel) is 
typically assessed by examining the degree of variability in the judgments 
expressed. 

The design of nagb's 1990 levels-setting procedure did not permit a full 
examination of the reliability of the judgment data. However, it was 
possible to examine the consistency of means across the four regional 
panels whose judgments formed the basis for the standards, nagb's 
technical report on the 1990 achievement levels acknowledges that there 
was "substantial and troublesome variability" in estimates of the basic 
level of achievement across these four panels. This variability could have 
stemmed from real regional differences in standards, but it could also have 
been the result of the procedural weaknesses we have discussed, (nagb's 
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<NAGB provided panelists at the first item judgment meeting with visual displays that show how the 
probability of getting an item correct changes as NAEP scores increase, but these displays were too 
technical to be useful. Subsequent panels in 1990 and 1992 received only overall performance data 

6 Panelists apparently made their own individual estimates of the results of student guessing. NAGB 
recognized that this might be a problem, especially if the panelists' methods vere different from those 
used in forming the NAEP scale, but did not see how it could be solved. 
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report recommended that the sources of measurement error be evaluated 
in future applications of the approach.) 

Whether panelists' judgments were well informed or not, nagb's 
procedures needed to identify the point on the naep scale that matched 
those judgments. We now examine how this was done. 



Transforming Item To transform the item judgments into a naep score, nagb summarized the 

Judgment Results Into a item judgments by calculating the percent correct for each level across all 

NAEP Score items and then found a matching percent correct point on the naep scale. 

To illustrate, item judgment panelists set 48 percent correct as the 
standard for the 8th grade basic level, and 48 percent corresponds to a 
score of 255 on the naep scale. All the scores chosen as standards, we 
found, represented the percent correct specified by the judges. However, 
we questioned whether these scores represented the performance 
expected for each achievement level for two reasons: 

1. The percent correct summarizes how many items students should be 
able to answer, without regard to whether the items are easy or difficult. 
Thus, it loses information about the expected pattern of performance on 
items of varying difficulty — information that is important to the distinction 
between the achievement levels. 6 



2. The naep scale is not a simple measure of the percent of items answered 
correctly. Rather, the scale score reflects the pattern of performance on 
easier and more difficult items. To illustrate, 48-percent correct achieved 
by answering only the easier items (as might be expected of a basic-level 
student) corresponds to a lower naep score than getting 48 percent correct 
by answering both easy and difficult items, nagb's procedure finds a 
central point in this range of scores that represents 48-percer f correct, but 
this central point may not represent the pattern of performance 
appropriate to the basic-level student. 

Taking these two problems into account, the question arises whether 
nagb's procedures identified the point on the naep scale at which 
performance indeed matched the expectations for different levels of 



"The rationale for using the percent correct mechanism is strongest when the test is designed to 
measure performance at the standard being judged — that is, when most of the items on the test are 
items that most students who have just reached the standard (but not those who fall below the 
standard) will be able to answer. If a test contains large proportions of items that are either too easy or 
too difficult with respect to the standard, the rationale for using the percent correct is weakened 
considerably. 
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mastery of different types of items expressed in the item judgments. If not, 
it would be invalid to interpret the naep scores in terms of these 
expectations as nagb's approach seeks to do. 



Summary We found that nagb's procedures did not resolve the issues of 

measurement and interpretation raised by its definitions of the 
achievement levels. To the contrary, our review of nagb's procedures 
reinforced our concern about these issues. We concluded that it was 
important to look closely at the test scores that nagb's approach selected 
to see whether the inferences of overall performance, mastery, and 
readiness implied by nagb's definitions and achievement level descriptions 
were in fact supported. 



Validity of 
Interpretation: What 
Do NAGB's NAEP 
Score Standards 
Represent? 



Our validity review examined whether nagb's results and the 
interpretations given to them were consistent with other indicators of 
basic, proficient, and advanced achievement. We found that the literature 
on item judgments stresses the importance of conducting such a review as 
part of the judgmental process of setting standards, nagb's achievement 
levels policy paper recommended that validity studies be conducted, but 
no such studies were undertaken for 1990. 7 To conduct a preliminary 
evaluation of validity, we checked nagb's results against several readily 
available indicators of overall performance, mastery, and readiness. 



NAGB'S Standards as The most straightforward interpretation of nagb's results is that the score 

Indicators of Overall for the advanced level represents superior performance, that the score for 

p ^ the proficient level represents solid performance for the grade, and that 

renormance the gcQre for the bagic Jevel represents performance that is something less 

than solid. A glance at how American students performed on the 1990 naep 
mathematics test (figure 2.1) confirms that the score for the advanced 
level represents superior overall performance, one that few students even 
in high-ability classes were able to reach. The score for the proficient level 
represents above-average performance even for students in these classes. 
Clearly, performance at this level can reasonably be said to be solid, and 
the score for the basic level is substantially less than this, as by definition 
it should be. 



7 NAGB titled its spring 1991 item judgment procedures plan a "replication/validation study." However, 
this plan simply provided for new item judgment panels: it did not include any assessment of the 
validity of the interpretations given to the standards produced by these panels. 
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Figure 2.1: NAGB Achievement Levels and NAEP Score Distributions (Percentiles) in 8th Grade Mathematics, 1990 
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We also checked nagb's results against data on the mathematics 
achievement of 9-year-olds and 13-year-olds from the 1990-91 International 
Assessment of Educational Progress (iaep). The iaep test was similar to 
naep, although the content was acjjusted to be reasonably representative of 
curricula across the various participating nations. We identified the score 
attained by the top 1 percent of U.S. 9-year-olds on the international test 
(equivalent to the percentile of 4th graders that nagb's approach classified 
as advanced) and identified the proportion of students from other nations 
who equalled or exceeded this score. The same procedure was applied to 
the scores of 8th graders and 13-year-olds. (The iaep report did not provide 
sufficient detail to allow similar comparisons at the basic and proficient 
levels.) 

Fewer than 5 percent of the 9-year-old students in any nation tested 
demonstrated advanced achievement according to this comparison. For 
13-year-olds, 10 percent of the students in Taiwan and at least 5 percent of 
those in China (restricted sample) met this standard; in no other nation did 
as many as 5 percent meet the advanced threshold. This comparison 
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indicates that nagb's advanced level represents superior performance even 
by world standards. 

The question is, What more do these scores represent? Is it valid to infer 
that they represent the points at which students have attained the skills 
and readiness that nagb's definitions and level descriptions imply? For 
example, is it valid to infer that nearly 40 percent of all 8th graders, over 
half of I *ose in disadvantaged urban communities, and 10 percent of those 
in classes that teachers identified as containing w high-ability n students 
have not even partially mastered fundamental skills as figure 2.1 and 
nagb's definition of the basic level would suggest? To examine these 
questions, we first examined how students at the scores nagb selected for 
each level actually performed. 



The Mastery Interpretation To examine whether student performance at the naep scores nagb selected 
Of NAGB's Standards ™ 1990 ma *ched the mastery expected for each level, we calculated the 

percentage of students that judges thought should answer items of varying 
difficulty and compared it to the answers that students at the naep score 
for each level actually gave. First, using the final round of 1990 item 
judgment data, we divided the 137 questions on the 1990 8th grade 
mathematics test into four groups— the 35 easiest items, the 34 moderately 
easy items, the 34 moderately difficult items, and the 34 most difficult 
items— and then calculated the percentage of students who should answer 
items in each group correctly at each achievement level. This gave us the 
pattern of performance on different groups of items that judges expected 
at each level. 8 (For further information on the performance patterns, see 
appendix VI.) 

Next, we estimated the percentage of sti dents at each level (that is, at the 
naep scores of 255, 295, and 336) who actually did answer items ir; each 
group correctly. 9 We then examined whether student performance at each 
naep score matched panelists' expectations with respect to items that they 
should have mastered (that is, items that 80 percent or nearly 80 percent of 
students should have been able to answer). 



*We used judgment data from the final 1990 round of item judgments, and we used 8th grade as our 
example because actual percent correct data by achievement level were available for this grade but 
not for the others. We calculated the performance patterns for the 4th grade based on item judgments 
also and found them to be very similar to the 8th grade patterns. 

*We used actual percent correct data for the 61 items made public from the 1990 test. There were at 
least 13 such items in each difficulty group, and the percent correct figures by item group for these 
items were comparable to those for the entire item set We divided the items into groups based on data 
showing the percentage of all students (out of those who attempted an answer) who actually got each 
of the 137 items correct 
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Our results are shown in table 2.2. With respect to items that should have 
been mastered, the comparison shows that at the basic-level naep score of 
255, students did considerably better than they should have according to 
nagb's judgments on the items relevant to that level. Students at V. : 
proficient-level score of 295 also performed better than they should have 
on relevant items, although the difference is not great. Students at these 
levels fell below panelists' expectations regarding performance on items 
that they were not expected to master. For example, only 16 percent of 
basic-level students answered the most difficult items correctly, compared 
to the item judgment expectations of 26 percent. For the proficient levels, 
the figures were 38 percent compared to 56 percent. These data suggest 
that judgments with respect to these items were unrealistically high and 
that their inclusion pushed up the standards for these levels. 



Table 2.2: Panelists' Judgments 
Compared to Actual Performance on 
Items Relevant to Each Achievement 
Level 



Achievement level and Item group 


Percent correct 

Item judgment 
expectations 


1990 test 
results 


Basic level (255) 






Easy Items 


67% 


83 c 


Moderately easy items 


50 


60 


Proficient level (295) 






Easy items 


85 


93 


Moderately easy items 


75 


84 


Moderately difficult items 


67 


67 


Advanced level (336) 






Easy items 


96 


96 


Moderately easy items 


92 


95 


Moderately difficult items 


88 


90 


Most difficult items 


80 


71 


Source: National Assessment Governing Board, The Levels of Mathematics Achievement 
(Washington, D.C.: 1S91), vol. 3, Technical Report, pp. 265-71 (item judgment data), and vol. 2. 
State Results for Released Items, pp. 3-35 (percent correct data). 



This evidence suggests that while students at the basic and proficient 
score standards can be said to have met the mastery expectations for their 
level, students at lower scores may have matched these expectations as 
well For the basic level, the matching score could have been considerably 
lower (around 240-42); at this score, the basic standard would have been 
met by a considerably larger percentage of students than nagb reported 
(75 percent rather than 62 percent). 
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In contrast, students at the advanced level did less well than they should 
have according to nagb's judges on the most difficult items. However, they 
achieved the required overall percent correct by performing better than 
expected— indeed, nearly flawlessly— on the easier but less relevant three 
quarters of the test. This again suggests caution against invalid inference: 
overall performance on a test that consists largely of standard grade-level 
materials does not necessarily imply mastery of items that are "advanced" 
in the sense that they require complex thinking. 



The Predictive Validity of 
the 12th Grade Standards 



The nagb definition of proficient achievement at the 12th grade level states 
that students at or above that level are prepared for college study and for 
productive work. We examined information relevant to these predictions. 10 

We found that nagb's paragraph describing what proficient 12th graders 
should know (see appendix V) was generally consistent with the College 
Board's summary of what students need to know to undertake 
postsecondary study in mathematics. 11 As noted in chapter 1, nagb found 
that only 16 percent of 12th graders had met or exceeded the proficient 
level and thus should be considered ready for colleg^Work. We calculated 
that this 16 percent represents about 27 percent of the students who enroll 
directly in higher education after high school. 12 If only 27 percent of 
freshmen are prepared for college study in mathematics, then 73 percent 
are not prepared. The American Council on Education reported, in 
contrast, that 29 percent of incoming freshmen in 1991 identified 
themselves as needing remedial work in mathematics. 13 Even allowing for 
the fact that not all students may recognize their need for remedial work, 
the gap again raises questions about interpreting the achievement level 
scores in terms of nagb's definitions. 

To evaluate the reasonableness of the proficient standard with respect to 
preparation for productive work, we compared nagb's description of 12th 



10 The definition also pn diets readiness for democratic citizenship and for responsible adulthood. We 
did not examine these pn dictions. 

"College Board, Academic Preparation for College (New York, N.Y.: 1983). 

l 2The Bureau of the Census has estimated that 60 percent of 1990 high school graduates enrolled in 
higher education the following fall. (This figure is reported in U.S. Department of Education, National 
Center for Education Statistics, The Condition of Education, 1992 (Washington D.C.: 1992), p. 28.) We 
assume that virtually all the students who score in the top It) percent on NAEP are part of the 
college-bound group. Sixteen percent of the total 12th grade population is equivalent to 27 percent 
(16/60) of the college-bound population. 

n A w. Astin et al., The American Freshman: National Norms for Fall 1991 (Los Angeles: Higher 
Education Research Institute, UCLA, 1991), p. 14. 
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grade proficient performance to findings from the Secretary's Commission 
on Achieving Necessary Skills (scans Commission), to findings from a 1990 
New York Sta;>3 field study of the mathematics skills requirements of a 
wide range of jobs that do not require a 4-year college degree, and to the 
skills covered in certification tests for six such occupations. 14 Our review 
suggested that the skills nagb described in connection with the basic level 
would be sufficient for productive employment in most jobs that do not 
require a college degree. The materials we reviewed indicate that while 
many such jobs require the application of modest levels of knowledge of 
antiemetic, measurement, and probability, very few use even the simplest 
algebra or geometry. 15 

With respect to readiness for rigorous college study, we compared nagb's 
finding that 2.6 percent of 12th grade students reached the advanced level 
with the report that 1.5 percent of U.S. 11th and 12th graders took 
advanced placement exams in calculus in 1991 and 1 percent scored well 
enough to be deemed eligible to receive college credit. This suggests that 
nagb's standard identifies a percentage of the student population roughly 
comparable to the percentage that are not only ready for college study but 
have already undertaken it while in high school. The 1990 naep test did not 
cover calculus, and there is no way of knowing whether the two 
procedures identify the same segment of the student population. 



Conclusion: Validity of 
NAGB's Score 
Interpretations 



We conclude that the naep scores selected through nagb's procedures are 
incomplete and somewhat misleading representations of the achievement 
levels. Student performance at the naep scores selected cannot validly be 
interpreted in terms of nagb's definitions or of the item judgments. The 
naep scores cannot be used to find the percentage of students who have 
met the content mastery and readiness criteria nagb defined for each level. 



Flaws in procedure, such as the lack of information support to panelists 
(especially of level-specific performance data) and possibly the use of the 
percent correct mechanism to translate item judgment expectations into a 
naep score, contributed to the lack of correspondence between the scores 
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U U.S. Department of Labor, Secretary's Commission on Achieving Necessary Skills, Skills and Tasks 
for Jobs (Washington, D.C.: U.S. Government Printing Office, 1992), and New York State Education 
Department, "Report to the Board of Regents on Career Preparation Validation Study," Albany, N.Y., 
n.d. We reviewed the occupational tests administered by ETS for cosmetology, production and 
inventory control, cash management, construction code inspection, construction supervision, and food 
protection inspection. 

lfi The argument can be made that the high-performance workplace of the future will require workers to 
have more extensive quantitative knowledge than do most current jobs. But NAGB's readiness criteria 
are stated in terms of readiness for any productive work. The proficient level is supposed to represent 
what all students need, not just what students need to enter a field that relies on mathematical skills. 

To 
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nagb selected and the interpretations offered for them. Failure to validate 
whether the naep scores selected matched the expectations for mastery 
expressed in the item judgments or in the level definitions constituted a 
further important flaw. 

Most important of all, and contributing to the fundamental measurement 
problems inherent in nagb's approach, is the critical conceptual issue 
raised at the beginning of this chapter — the issue of whether the mastery 
dimension of nagb's achievement levels can adequately be represented 
using the naep test and scale, which were designed to depict overall 
performance. Whereas the procedural flaws suggest merely that nagb's 
approach was poorly executed, the conceptual issue speaks to the 
approach itself: our data suggest that it was invalid for the purpose of 
drawing inferences about content mastery. We will return to this u»sue in 
chapter 3. 



nagb's aim in establishing achievement levels was to make naep scores 
more interpretable — to enable readers to compare current performance 
against a standard of what students should know and be able to do. As we 
have seen, however, nagb's interpretations of the achievement levels 
(which confound what panelists thought students should do with what 
they actually could do) have been misleading. 

nagb described its procedure as the application of standards that represent 
broad consensus on what students should know, which is also misleading. 
This language creates the impression that the achievement level 
descriptions represent general content mastery standards — standards that 
were developed independently of, and subsequently applied to, naep data. 16 
In fact, the descriptions were a product of (not a guide to) the 
standard-setting process; they represent test items that met certain narrow 
selection criteria. 

nagb surveyed various user groups to ascertain whether they found the 
achievement levels useful, and most did. Many user representatives 
responded that the levels report had cl3arly conveyed the significance of 
student performance, but others found the report unclear. We note that 



lfl For example, on the basis of the description of the levels in the National Goals Report, officials of the 
U.S. Metric Association took NAGB's levels descriptions to be general standards and to imply that only 
advanced 8th grade students are expected to understand metric measurements. They expressed 
concern to GAO. When we explained the very restricted bases of the levels descriptions, these officials 
commented that referring to these descriptions as standards of what students should know was very 
misleading. 



NAGB's Presentation 
of the Results 
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respondents did not have information that would have alerted them to 
weaknesses in nagb's approach and data, nor were efforts made to explain 
and rule out competing plausible explanations of nagb's findings of poor 
student performance. Indeed, the survey did not ascertain whether 
respondents had interpreted the data accurately. 

nagb was aware of some of the limitations of the levels data when it 
published the 1990 levels results in September 1991 but neither cautioned 
readers concerning the reliability of the data nor noted that validity had 
not been established. The technical report that appeared late in 
November did offer cautions on reliability grounds, nagb itself was 
unaware of other significant limitations, largely because a thorough review 
of its basic approach had not been conducted. (Such a review is required 
in preparation of all naep reports, which are published through nces, but 
not for reports published under nagb's independent authority, nces 
officials have told us that the 1990 levels results probably would not have 
passed this review.) 

Moreover, although nagb's achievement levels report did describe its 1990 
standard-setting as a trial effort that v/as based on a process that "while 
imperfect, was serviceable," it presented the results of that process as 
"revealing and diagnostic" and urged those seeking to make change to take 
action based on nagb's standards and data 17 Such advice was premature. 



Th a 1 QQ9 T nagb was aware of some of * e A aws in its procedures and acted to correct 

1 ne L<eveiS them fQr 1992 It secure( i t h e services of a contractor, act, that is 

Procedures experienced in standard-setting and in the use of item judgment methods 

and has access to considerable technical expertise, act has proposed and 
implemented improvements that have strengthened the item judgment 
procedure. These include careful attention to panelist selection, improved 
training, the development of guiding definitions for each subject and grade 
prior to beginning the item judgment process, and review of the reliability 
of the judgment results. However, critical weaknesses inherent in nagb's 
overall approach remain unaddressed, as indicated in table 2.3. 
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l7 National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 1, National and 
State Summaries (Washington, D.C.: 1991), pp. llandix. 
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Table 2.3: 1990 Weaknesses and 1992 
Procedures 



1990 weakness 

1. Levels definitions incorporate dimensions 
ill-suited to the NAEP scale; no clear 
standard for performance in mathematics 



2. Standard for each level is based on both 
relevant and irrelevant items 

3. No level-specific performance data to 
enable panelists to estimate realistically 

4. Reliability of item judgments not fully 
assessed 

5. Percent correct score used to summarize 
item ratings, assumed to capture pattern 



6. No examination of validity of inferences 
made from NAEP scores 



1992 procedure 

Original definitions not changed; panelists 
developed a working standard for each 
level and grade by applying the definition 
to the NAEP mathematics test framework 

Unchanged 



Unchanged 



Reliability being examined; results not yet 
available 

Unchanged. Use of a pattern-based 
method has been found to change scores 
only slightly 

The question of inference being examined 
as part of the NCES prepublication review 
process 



In addition, the 1992 procedures pose new issues and problems. The 
working definitions created by the item judgment panels (item 1 in table 
2.3) provide the specific standards for performance in mathematics that 
were missing from the 1990 approach, nags plans to use these working 
definitions of what students should be able to do to interpret student 
performance at each naep score selected as a standard. These include a 
working definition that makes clear that the advanced level represents 
complex understanding of mathematics, not simply stronger performance 
on a wide range of problems. As we have shown, however, nagb's 
procedure cannot be assumed to select a naep score at which students' 
mastery matches judges' expectations. Unless a good match can be 
demonstrated, the working definitions cannot be used to interpret 
performance at the score selected. 

Finally, nagb's standard-setting procedure was applied to reading and 
writing for the first time in 1992. The reading and writing assessments 
made much more use of extended-response questions than the 
mathematics exam. The item rating procedure, which is easy to apply to 
multiple-choice questions, must be adapted for use with 
extended-response questions, act pilot-tested procedures forjudging such 
items and concluded that they were feasible, but it is too early to say 
whether the actual judgment panels were successful. New procedures and 
data sources may be required to check validity for the reading and writing 
standards. 
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nagb's publication policy specifies that the achievement levels were to be 
the basis for reporting naep results for 1992, and nagb has directed nces to 
instruct ets to use the achievement level naep scores as benchmarks for 
reporting student achievement and to offer the judgment panelists' 
working definitions as interpretations of performance at each of these 
scores, nagb's policy also specifies that all naep reports must meet nces 
technical quality standards, which are similar to those applied in this 
report. In light of the many questions we and others have raised abouc 
nagb's approach, the commissioner has several times urged that nagb 
continue to use the old method of reporting naep results until the 
achievement level approach can be shown to be sound. On each occasion, 
nagb has reaffirmed its commitment to the use of the achievement levels. 



Conclusions: the 1992 
Approach 



We conclude that while act's 1992 procedures have addressed some of the 
problems that affected the 1990 standard-setting, the fundamental problem 
of finding a test score that can validly be interpreted in terms of nagb's 
definitions and descriptions remains unaddressed. If anything, the gap 
between the level definitions, the achievement level descriptions, and 
actual performance at the naep score selected for each level is likely to be 
greater than before. Unless and until nagb can show that its approach is 
internally consistent and produces valid interpretations of the Naep scores 
selected to represent each level, it should either refrain from reporting in 
terms of the achievement levels at all or present the levels scores simply 
as nagb's judgmental standards for partial, solid, and superior 
performance, without further interpretation. 18 



rhrprall PnnHiiQinn ^ NAGB noted 1x1 its 1990 P olic y on settin g achievement levels, its 

WVerail V^OIlClUblOIl approach was an experiment, nagb concluded that its experiment was 

sufficiently successful to warrant the continuation of the approach, with 
procedural improvements. However, it did not analyze the effects of the 
modifications it made in the Angoff process, nor did it examine whether 
the scores it selected could be validly interpreted in terms of the levels 
definitions. Having done so, we reach a different conclusion. 



1B The 1992 NAEP results for mathematics were published in April 1993. The achievement levels did 
undergo NCES prepublication review. NCES accepted NAGB's levels-setting process and the resulting 
scores as given and focused on the question of valid inference — on whether NAGB's statements of 
what students at these scores should do are useful for interpreting what such students actually can do. 
Early results from technical evaluations suggested a need for further examination of the inferences 
that can be drawn from NAEP scores. Both the achievement level approach and the conventional 
anchor point approach are now under review. The 1992 NAEP report presents data both ways, alerts 
readers to the questions that have been raised, and urges readers to assess for themselves how well 
the various forms of reporting meet their needs. 
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We conclude that nagb's 1990 approach was inherently flawed, both 
conceptually and procedurally, and that the evaluation team's advice — that 
the approach not be used further until a thorough review could be 
completed — was warranted, nagb's approach identified scores that 
represent different levels of overall performance, but these scores are not 
necessarily evidence that students have the skills and knowledge specified 
for that level, nagb's approach w:is not changed in fundamental respects 
for 1992 and is likely to produce unsupported and quite possibly erroneous 
s?* interpretations with respect to the 1992 naep tests also. 

These weaknesses are not trivial; reliance on nagb's results could have 
serious consequences. For example, policymakers might conclude that 
since nearly 40 percent of 8th grade students did not reach the basic level 
(a naep score of 255), resources should be reallocated so as to emphasize 9 
fundamental skills for most classes. Since many students who scored 
below 255 were in fact able to answer basic-level items (according to our 
analysis), this strategy could retard their progress toward mastering more 
challenging material. Similarly, parents or educators might conclude that 
nagb's description of the advanced 4th grade level (which simply 
summarizes certain difficult test items) represents the mathematics skills 
that should be taught to gifted children and focus their curriculum 
accordingly. Other testing entities might adopt nagb's procedures, on the 
understanding that they produce valid and useful results. And finally, naep 
might abandon its existing straightforward empirical basis for score 
interpretation in favor of one that is unrelated to actual performance. 



In light of the many problems we found with nagb's approach, we 
recommend that nagb withdraw its direction to nces that the 1992 naep 
results be published primarily in terms of levels. The conventional 
approach to score interpretation should be retained until an alternative 
has been shown to be sound. 

Second, we recommend that the Chairman of nagb and the Commissioner 
of Education Statistics develop a joint plan and schedule for a review of 
nagb's achievement levels approach (its definitioas of achievement, score 
selection procedures, and score interpretation), taking into account 
evaluations that are currently under way and providing for additional 
activities as needed. The plan should begin with a review of existing 
critiques of the approach and should include, at an early stage, a 
determination by the commissioner whether (1) nagb's approach will 
necessarily produce invalid interpretations of naep scores and should not 
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be pursued or (2) the approach is sufficiently promising that a specific 
plan for preparing for nces prepublication review should be designed and 
implemented. 19 If option 1 is selected, the case is closed. If the decision is 
to proceed, nagb should develop evidence that the levels results are valid 
and reliable and that the interpretations suggested for them are supported. 
nces should make clear what evidence will be required. 



ie As discussed in detail in chapter 4 t the commissioner is responsible for ensuring that NAEP depicts 
achievement fairly and accurately and for conducting reviews and validation studies of the 
assessment 
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The second study question asked us whether nagb's approach is suited for 
use with naep and whether alternative approaches might provide better 
standards for goal achievement. In chapter 2, we concluded that nagb's 
achievement levels approach, which sets standards for overall 
performance on naep but interprets them in terms of what students at each 
performance level should have mastered, is unworkable. In this chapter, 
we do not discuss nagb's approach further. Instead, we consider the 
general question of how naep might be used to set goals, benchmarks, or 
standards. Our analysis distinguishes between overall performance 
standards and content-based performance standards. We begin this 
chapter with a general discussion of these two types of standards and then 
evaluate alternative approaches to each one. The chapter ends with 
recommendations. 



Two TVpes of 
Performance 
Standards 



Student performance standards can take either one of two forms. Overall 
performance standards identify how much performance is enough and are 
expressed in terms of a total score on a test of knowledge that is generally 
relevant to the standard (that is, a test that focuses on the material that 
students are expected to know at the levels of difficulty they are expected 
to be able to handle). Overall performance standards are typically used to 
determine whether a student knows enough to be considered qualified for 
some specific purpose such as high school graduation or professional 
certification. This type of standard may also be used to group scores 
according to descriptive categories such as unqualified, marginally 
qualified, and fully qualified. The standard is generally measured in terms 
of the number or proportion of questions answered correctly. Any pattern 
of right answers that yields a score at or above the overall performance 
score standard is acceptable, since all items on the test are deemed 
relevant to the standard. 



Content-based performance standards, in contrast, indicate that a student 
has mastered specified subsets of content at an acceptable level. Only 
performance on items whose content is pertinent to the standard is taken 
into account in setting the standard and in measuring student performance 
in terms of the standard. If a test covers several content areas (such as 
algebra, geometry, and statistics), each of which is the subject of a 
standard, items are sorted by content area and a separate scale is formed 
and a separate standard is set for each area. 
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The two standards serve different purposes, provide different kinds of 
information, and require different properties of the test on which they are 
based. Differences between the two types of standards are summarized in 
table 3.1. Our analysis focuses on whether naep tests are adequate to 
support each of the potential alternative methods we have identified, how 
each method might be applied, and how its results can be interpreted. 



Table 3.1 : Overall Performance 
Standards and Content-Based 
Performance Standards 



Ov erall performance standards 

Purpose : To group test scores into 
categories that represent levels of 
performance or qualification for some stated 
purpose. Key question: Have students 
learned enough? 



Content-based pe rformance standards 

Purpose : To identify the performance that 
signifies achievement of a content mastery 
standard. Key question: Have students 
learned what they should have learned? 



Prerequisite : Agreement on what constitutes 
the standard. Standards may take the form 
of ordered descriptive categories such as 
acceptable and outstanding or may be 
defined in terms of the proficiency needed 
to perform successfully in some current or 
future status 



Prerequisite : Content standards (statement 
of the knowledge and skills that students 
are expected to master) 



Performance standard : Score on the test as 

a whole ~~ 

Test requirements : Test must cover the 
knowledge areas specified at an 
appropriate level of difficulty 



Performance standard : Score based on 
items p ertinent to each content standard 

Test requirements : Test must be aligned to 
the content standards and must contain 
sufficient items pertinent to each standard 
to support accurate measurement 



Interpretation : Someone who scores at or 
above the performance standard is said to 
know enough to meet the standard 



Interpretation : Someone who scores at or 
above each performance standard is said 
to have mastered the content addressed in 
that standard 



The choice of method must ultimately be guided by the choice of purpose 
If the purpose of setting national standards for student performance is to 
determine whether students have reached an acceptable overall level of 
mastery of grade-level material, overall performance standards are 
appropriate. Overall performance standards, however, do not indicate 
whether students have mastered the skills and knowledge specified in 
national content standards. Content-based performance standards serve 
that purpose. 
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Setting Overall 
Performance 
Standards for NAEP 



naep tests are designed to cover knowledge generally relevant for a given 
subject and grade and, thus, provide a potentially suitable basis for 
applying standards of overall performance on grade-level material. Since 
naep tests are designed to be most accurate in the range of performance 
typical of the majority of students, they are potentially well suited for 
expressing overall performance standards applicable to the majority of 
students. As suggested in table 3.1, it is important to identify key 
knowledge areas and the level of difficulty that students should be 
expected to handle in each area and to design naep tests accordingly. 1 



naep can be used to set standards for levels of performance that represent 
a challenge to today's average student. However, naep is likely to provide 
insufficient data to support accurate measurement at very high levels of 
performance that very few students reach. Care must be taken to ensure 
that criteria for and interpretations of high overall performance are 
expressed in terms of consistent mastery of the various grade-level 
materials covered on the test, rather than mastery of the most difficult 
items specifically. 



We examine three methods through which overall performance standards 
for naep might be set: methods based on current performance, on cri terion 
performance, and on test items . We found that while these methods can be 
used singly, they are commonly used in combination. 



Standards Based on Overall performance standards can be selected by examining how various 

Current Performance categories of students currently perform on the test, interpreting the 

overall capabilities represented by test scores at various points on the 
scale, and putting these two kinds of information together to select scores 
to be used as benchmarks or standards for future years. The selection of 
scores under this first method is truly a matter of informed judgment, 
reached by consensus. The method consists of defining categories of 
overall performance, assembling individuals representing diverse views of 
what students can and should achieve on a naep test, giving them data on 
actual performance and assistance in interpreting performance at various 
score levels, and asking them to select a score or ranges of scores to 
represent each overall performance category. 



Hn the past, NAEP tests have concentrated on materials that are covered in most classrooms, in effect 
using the existing common curriculum to define what students should learn. The common curriculum, 
however, does not necessarily represent expert and citizen consensus concerning what should be 
taught To measure progress with reference to emerging standards as well as to past and current 
practice (as is NAGB's policy for future tests) requires striking a delicate balance in test design. 
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In one variant of this approach, the basis for setting standards is 
performance-distribution data. Standard-setters examine the distribution 
of scores for the total sample and for subgroups of students and select 
total scores that appear to be suitable benchmarks for the various 
performance categories. (Categories can be given labels such as 
marginally qualified and fully qualified or can be expressed in terms of 
target percentiles— for example, as the score that at least 80 percent of 
students should achieve.) Patterns of item mastery associated with each 
score are examined to confirm that the score is appropriate in terms of the 
knowledge and skills expected for each category, and the potential 
benchmark is acjj usted upward or downward until members of the panel 
reach consensus that a score that represents the performance category in 
question has been found. Examples based on naep might include selecting 
the score that represents the current 75th percentile for students in 
high-ability classes as the standard for "excellent" performance or using 
the 15th percentile score achieved by students in the top two thirds of 
schools as the 15th percentile standard for the nation as a whqle. In a 
second variant, standards are set by identifying several test items that 
typify each level of proficiency to be established, determining the point on 
the overall performance scale at which most students are able to answer 
each typical item, and selecting a midpoint score or upper and lower 
boundary of each proficiency category based on these data. 

In either variant, the selection of performance-based standards is a matter 
of prescriptive judgment supported by data. (Judgments based on overall 
performance might consult any of the types of data we used in chapter 2 in 
examining the validity of nagb's score interpretations. Judgments based on 
item mastery might seek evidence that the benchmark items are good 
indicators of mathematics proficiency, relatively unaffected by factors 
such as reading proficiency or familiarity with a particular item format.) 
The chief requirements for setting standards based on inspection of 
current performance are that the selection body's claim to represent broad 
consensus be credible and that its decisions be supported by logic and 
information. 

The methods just described can be used with existing naep tests and do 
not require that detailed performance criteria be spelled out in advance. 
Standards selected on the basis of actual performance (with attention to 
student diversity) send a message that what is accomplished in some 
schools should be expected of all. The chief drawbacks of 
performance-based methods are that judgments may be inadequately 
informed and (at the other extreme) that data-gathering may consume 
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undue time and effort and bury decisionmakers in details. These 
drawbacks can be prevented by agreeing in advance what key viewpoints 
and what types of data will be considered and how diverse views will be 
taken into account and by providing decisionmakers with analytic support. 



Standards Based on Overall performance standards may also be set by drawing on independent 

Criterion Performance information concerning students' mastery of the material covered by the 

test. The general steps in this method are 

1. describe the characteristics of students at each performance category to 
be established — for example, what the marginally proficient, proficient, 
and exceptionally proficient student should be able to do with respect to 
the materials covered on the test; 

2. find individual students or groups of students who are independently 
judged to match these descriptions; 2 

3. ascertain how these individuals or groups perform on the naep test and 
what naep scores would be estimated based on their performance; and 

4. use these data to find a benchmark score or range of scores that 
represent qualified performance. 

The General Educational Development (ged) testing program provides an 
example of this type of procedure. The ged examination is a multisubject 
test battery covering core knowledge in high school subjects and is used 
to determine whether a student who did not complete high school is 
qualified to receive a high school equivalency certificate. The program 
uses graduation from high school as its criterion. It regularly administers 
its tests to samples of graduating seniors and, using these results, ac^justs 
the minimum passing score on the ged examination so that it reflects the 
30th percentile: the score that divides the lower 30 percent of diploma 
holders from the upper 70 percent. 3 Similarly, the advanced placement 



2 It may also be useful to identify students who would defim; *ly not be expected to meet the criterion, 
such as students enrolled in the next lower grade. 

^e choice of 30th percentile represents a policy judgment by the GED program, which we present for 
illustrative purposes. Participating states are free to (and do) select higher figures, in line with 
expectations for and the performance of their own graduates. 
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program calibrates the scoring of advanced placement tests to the 
performance of students in college courses. 4 

For some types of standards, criterion groups cannot be readily located. 
For example, there is no simple way to find a group of "fourth graders who 
have mastered the skills fundamental to challenging fourth-grade work" 
(the appropriate group for nagb's "basic" level of achievement). Where 
there is no obvious criterion group, experts can be asked to identify 
individuals or test papers that exemplify the performance associated with 
the criterion. For example, teachers whose classes were selected for 
inclusion in the naep sample might be trained to understand naep 
performance categories and asked to identify and to enter a code on the 
test booklets of students whose classroom performance meets the criteria 
for each category. Alternatively, experts involved in the grading process 
could be asked to identify completed test papers that exemplify 
performance that meets those criteria, naep scores projected on the bases 
of the exemplary papers could then be examined. 

We conclude that criterion performance methods could be used with 
current naep tests. Criterion performance methods might require that tests 
be given to groups not included in the normal sample or that new 
procedures be instituted to identify exemplary papers within that sample. 

Criterion performance methods have an advantage in that they build in 
attention to empirical validity. The corresponding disadvantage of such 
methods is that they work well only when criterion performers can be 
readily identified. Given the imprecision of all tests, scores associated with 
criterion performers are likely to be somewhat spread out, and it may be 
difficult to find a single score that distinguishes qualified or exemplary 
students from others. Since naep scores are not being used to make 
decisions about individuals, however, there is no requirement that a 
benchmark selected to represent a category of performance be set at the 
bottom of the achievement range associated with that category. A 
benchmark that represents a central point in that range may be equally 
useful. 



Standards Based on Test 
Items 



In a third approach, overall performance standards can be established 
through judgments concerning performance on test items, as in the 
standard Angoff method. The province of Alberta, Canada, for example, 



'Decisions as to whether to grant college credit are made by colleges, which can and do set standards 
that exceed those suggested by the testing program. 
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identifies standards for excellent performance on provincial assessments 
for elementary school subjects (expected of the top 15 percent of 
students) and adequate performance (expected of at least 85 percent of 
students) by applying item judgment methods to the test as a whole. 5 

We find that naep tests could form the basis for applying overall 
performance standards through Angoff item judgment procedures, 
provided that the standards were not aimed at extremes of achievement 
(where naep is less accurate) and that the criteria for sound 
implementation of judgment procedures were met. Criteria for sound 
implementation of item judgments include (I) clear, consensus-based 
specification of what students at each level should be able to do on the full 
range of items on the test; (2) panelists who have the expertise and 
information to make accurate judgments of how students who actually 
have those capabilities will perform on each test item; and (3) review of 
student performance at the naep score selected to confirm that it matches 
panelists' expectations. 



Concluding Observations The methods discussed above incorporate different sources of information 
on Overall Performance about student performance and thus suggest different bases for 

Standards interpreting the test scores selected as standards. Whatever the method or 

combination of methods for setting the standard (actual performance, 
criterion performance, or item judgment), it is important to confirm and to 
be able to provide evidence to validate or support any interpretations that 
are given. Whether the item judgment process, which is expensive, 
contributes sufficient interpretive information — beyond what can be 
obtained through the other methods — to be worth the cost is a question 
that merits further review. 



Feasibility of Overall 
Performance Standards: 
Current Studies 



Our review suggests that, in principle, each of the methods discussed 
could be applied to current or future naep tests, as long as the 
performance standards are expressed in terms of levels of mastery of the 
materials that make up the test as a whole. So far, only one method of 
standard-setting has been applied to naep. Experiments with additional 
methods, however, are in progress. Both act and the National Academy of 
Education (nae) are currently comparing nagb levels results to actual 



-Teachers make the item judgments, guided by statements of expectations based on the curriculum 
and by their own experience of what students in the upper and lower 15 percent of the distribution can 
do. 
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performance data, 6 nae also proposes to have experts construct 
achievement level scores based on items independently judged to be 
appropriate to each achievement level. 

These efforts will provide valuable information about the feasibility and 
usefulness of alternative standard-setting methods, as well as revealing 
whether these methods yield results that are consistent with nagb's. When 
their results are in, nagb and nces will be in a good position to compare the 
feasibility of and types of information produced by alternative methods of 
setting standards or by combinations of approaches. 



Setting Content-Based 
Performance 
Standards Through 
NAEP 



As shown in chapter 2, current naep tests and the naep scale were not 
designed to support content-based performance standards, (naep's 
purpose and design are summarized in appendix III.) However, existing 
tests might contain sufficient items to support some content-based 
performance standards, and future tests might be designed to match such 
standards. A general difficulty with the application of content standards to 
naep is that the United States has not established national standards 
describing what students should learn, nor is there a national curriculum. 
Thus, the usual prerequisites for the identification of content-based 
performance standards are not in place. 7 National content standards, 
however, are on the way. Standards are currently being developed for 
mathematics, science, history, and geography— subjects regularly assessed 
by naep— under conditions that are intended to produce the expert and 
citizen consensus essential to widespread adoption. 

We examined how existing naep tests might address content standards 
once these have been developed and adopted and how such standards 
might be accommodated within naep's design. Our analysis assumes that 
monitoring (that is, measuring current student performance and tracking 
changes in performance over time) will continue to be the major purpose 
of the assessment and that any changes designed to align naep to content 
standards will need to be compatible with that purpose. We further 
assume that naep will be part of a larger system of standards-related 
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GThe NAE study is part of an evaluation of the NAGB levels under contract to NCES. 

7 We have analyzed experiences in the Canadian provinces with developing tests and standards in 
connection with a predetermined curriculum. See U.S. General Accounting Office. Educational 
Testing: The Canadian Experience With Standards, Examinations, and Assessments , GAO-l LMD-SW-U 
(Washington, D.C.: April lte). 
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assessments of student performance at the classroom, school, district, 
state, and national levels. 8 



Applying Standards to Whether and how an existing naep test can be used to measure student 

Existing Tests performance in terms of a particular content standard will depend on 

whether the test contains a sufficient pool of items appropriate to that 
standard, nces officials estimate that at least 20 items are needed to 
measure a skill or content area accurately, and the items need to be 
reasonably representative of the domain of content described in the 
standard. Where there is an adequate pool of items to sustain accurate 
measurement, a content-based performance standard might be set for 
these items through item judgments or any other appropriate technique. 

For example, suppose that national mathematics standards define "core" 
proficiency for 8th grade in content terms. Experts might be convened to 
identify the naep items pertinent to core proficiency and to evaluate 
whether the item pool was sufficiently large and representative of the 
required skills. If there proved to be enough items that they could be 
legitimately combined to form a separate subscale, item judgment 
procedures (or indeed any of the procedures discussed in connection with 
overall performance standards) might be used to suggest the score that 
represents adequate mastery of the material. Or naep might be used to set 
a standard of adequate performance in mathematics with respect to each 
component area covered by the test (see appendix III) using the subscales 
that were incorporated into the test by design. 

Where there are not sufficient items to support a content-based 
performance standard, naep can report the percentages of students who 
answered illustrative items correctly and can identify the overall naep 
scale score at which items illustrative of the standard are consistently 
mastered. These strategies do not provide a measure of achievement of a 
standard, but they provide useful interpretive information. 

The chief advantage of taking naep's current design as "given" is that this 
strategy ensures that naep will continue to serve its statutory purpose of 
supporting fair and accurate description and trend analysis. The chief 
disadvantage is that some content standards — those not suited to 



*NAGB solicited public comment in the fall of 1992 regarding the role of NAEP in a system of 
assessments. NAGB's discussion paper asks readers to consider whether national content and 
performance standards should determine the content of each assessment or whether content should 
continue to reflect (evolving) current curriculum as well as these standards. It also asks for views on 
the continued use of the achievement levels and whether other approaches to identifying appropriate 
achievement goals should be considered. 
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measurement through naep— may be left unassessed at the national level. 
For example, a standard that every student should develop specialized 
knowledge of one out of six areas of applied mathematics would be 
difficult to measure through naep, because adding enough Items on all six 
areas to ensure that each student tested can show knowledge of one such 
area would make the test unworkably long. 



Designing NAEP to Fit We find that, in principle, naep could be designed to address national 

Content Standards content standards but that technical considerations limit the range and 

type of content standards for which naep can provide performance 
measures. Within a design that is intended to describe overall performance 
accurately, based on a sample of students each of whom sees only a 
portion of the test questions, naep will be best able to address standards 
that concern commonly taught materials that most students are expected 
to master and least able to address standards that concern specialized 
materials or knowledge to which few students are currently exposed. 9 

The implication is that there are significant limits to what can be expected 
of naep in measuring student performance against content standards. 
Assessment of selected areas of content as well as different levels of 
performance in each area is practically impossible because there will 
never be room in the test for all the necessary questions, nagb hoped to do 
both; the more realistic choice is one or the other. 

That is, naep could potentially be used to set a standard for general 
mastery of each of several broad skill areas (such as the component areas 
included in the 1990 mathematics exam— numbers and operations, 
measurement, data analysis, algebra and functions, and geometry) but not 
multiple levels of mastery within each area. Alternatively, naep could be 
designed to include items in various content areas at several levels of 
complexity and to measure the percentage of students who have mastered 
each level of complexity. By virtue of its design, naep cannot provide data 
pertinent to standards of either kind that apply only to small groups of 
students, such as those in specialized programs. Content mastery 
standards that refer to skills that cannot validly be assessed through a 



'Students (and especially less-proficient students) tend not to answer questions about material that 
they have oot had the opportunity to learn. The greater the proportion of unanswered questions, the 
less accurately NAEP can estimate student performance. A substantial increase in the proportion of 
items that n ©resent unfamiliar material- especially if it is material to which only advantaged or very 
able suiden s are exposed— may thus decrease NAEFs ability to describe the performance of average 
and belo* -average students accurately. 
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brief paper-and-pencil test (such as the skill of carrying out a long-term 
project independently) will also fall outside naep's scope. 

In summary, designing naep to fit national standards of general student 
mastery of commonly covered content and establishing standards for 
student performance with respect to these content standards could 
improve the usefulness of the assessment. However, designing naep to 
support standards that refer to specialized or emerging content areas 
raises a host of technical and policy questions and is also clearly 
premature, given that no standards have yet been developed or adopted. 
Until more is known about the nature of the standards for different subject 
areas and how these will be addressed by other elements in the 
assessment system, it is difficult to envision the test design and analysis 
techniques that would be most appropriate or how naep could best 
contribute to the overall system. 

nagb and nces could usefully explore how naep might respond to different 
types of standards within and outside of the traditional sample and design. 
For example, naep might use the traditional sample to address only the 
standards that refer to the mainstream curriculum but create experimental 
modules for use in a subsample of schools or in states whose curricula 
include these practices and report the results separately. Discussion and 
pilot-testing of possibly useful techniques now could help naep prepare for 
the day when standards become available. 



We conclude that while nagb's particular approach is not suitable for use 
with naep, alternative approaches are feasible. Approaches designed to set 
standards of overall performance seem more readily applicable to naep 
than content-based standards. Although they require consensus on 
relevant skills and skill levels, overall performance standards do not 
presuppose the existence of national content standards and do not require 
changes in the design of naep tests. Within the next few months, nagb and 
nces are likely to have data sufficient to evaluate whether any of the 
methods, or a combination, could result in standards that would be 
accepted as credible on both technical and policy grounds. However, 
overall performance standards cannot be used to assess whether students 
have mastered particular content at an acceptable level. 

Approaches designed to measure content master} —other than a simple 
item benchmarking approach — would be more difficult to apply and could 
require changes in test design. Content mastery is a matter of considerable 
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concern, and naep will undoubtedly be expected to monitor progress 
toward the achievement of national content standards. Therefore, we 
conclude that activities to explore how naep can be designed to support 
content-based standards without compromising its ability to serve its 
statutory purposes are also needed. 

Overall performance standards and content-based performance standards 
provide different information and represent different kinds of achievement 
goals. Our analysis suggests that it is important to clarify which of these 
kinds of achievement goals nagb and naep should address. 



.p H t ' q In ^ ev/ of the conce P tual ^ technical flaws inherent in nagb's 

KeCOmmenaatlOnS achievement levels approach (see chapter 2) and of the many questions 

that need to be resolved before an alternative standard-setting method can 
be selected, we recommend that nagb withdraw its policy of applying the 
1990 achievement levels approach to future naep tests and join with nces 
in exploring alternatives for setting both content-based and overall 
performance standards with respect to naep. This inquiry should examine 
issues of purpose, technical feasibility, cost, fairness, credibility, and 
usefulness. 

We recommend that the Congress specify what it intends in directing nagb 
to identify appropriate achievement goals: whether it envisions the 
establishment of overall performance standards, the establishment of 
content-based performance standards, or simply better alignment of test 
coverage with content mastery standards. Given that legislation to 
establish a mechanism for adopting national content standards is currently 
under consideration, che Congress may also wish to express specific 
guidance with respect to activities to align naep to content standards 
before such a mechanism is in place. 
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Our final study question concerns whether nagb, whose members for the 
most part are not technical experts, has knowledge resources and 
procedures sufficient to ensure that work done at its direction will be 
technically sound. We examined nagb's actions in planning and 
implementing the achievement levels approach with this question in mind. 
We also examined two other technical decision areas, so as not to base 
conclusions on what could have been an atypical example. 1 

Our evaluation of the achievement levels example takes account of the 
fact that nagb's approach to this highly important task was new and 
unusual. In addition to making untested modifications to Angoff item 
judgment procedures, nagb reversed the usual sequence of steps involved 
in testing student performance in terms of standards, nagb started with a 
test and derived standards of what students at three levels should know 
from the test items, using the item judgment process. The usual sequence 
is to develop broad expert and citizen consensus on what students should 
know, determine whether one or more levels of achievement should be 
measured, and design a test or tests accordingly. As illustrated by 
Canada's approach to national standards and testing and by the work of 
the National Board for Professional Teaching Standards in the United 
States, preparing for and developing standards-based testing takes much 
time and requires extensive technical support. 2 We looked for evidence 
that nagb considered the feasibility of its quite different approach carefully 
at the planning 3tage. Bearing in mind that difficulties with a new 
procedure cannot always be foreseen, we looked especially for evidence 
that nagb reviewed the results of its 1990 approach carefully before 
concluding that it was sound. 

In all three of the examples we reviewed, we looked for evidence that the 
technical feasibility and implications of the issue under discussion were 
examined and alternative approaches were evaluated, where feasible and 
appropriate, before action was taken; that technical procedures were 
adequately planned and supported; that expert advice was sought at key 
points during policy planning and implementation; that issues raised by 
such experts were addressed and resolved; and that the products of 
technical procedures were reviewed according to appropriate quality 
criteria before they were accepted. 



1 We did not review NAGB's handling of other kinds of decisions, such as decisions regarding the 
subject matter to be covered in each assessment Our findings with respect to technical isi ues do not 
reflect on NAGB's performance in other areas. 

2 0n the Canadian approach, see U.S. General Accounting Office, Educational Testing: The Canadian 
Experience With Standards, Examinations, and Assessments , GAO/PEMD-93-11 (Washington, D.C.: 
April 1993). 
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nagb has indicated that it considered the identification of achievement 
goals to be a matter of informed judgment rather than a technical issue, 
and it believes that it is not appropriate to apply criteria of technical 
quality to the setting of achievement levels standards. We agree that 
standard-setting is a matter of informed judgment However, when the 
selection of test scores to serve as standards rests wholly or largely upon 
technical procedures and data, as it did in the achievement levels case, it is 
not only appropriate but also essential to inquire whether those 
procedures and data were technically sound and based on adequate 
information. Whatever the method by which scores were selected, it is 
important to verify that the interpretations given to those scores are 
supported by evidence. 



Th p A ch\ pvptyi pn t We P resented a descriptive summary of the background, development, and 

1 lie i\CIlieveilieilL implementation of nagb's approach to setting performance standards in 

Levels Setting Case the form of achievement levels in chapter 1. For the present chapter, we 

analyze nagb's use of technical information at each key point in these 

events. 



Planning and Design Our analysis suggests that nagb adopted and modified the Angoff item 

judgment method as the basis for its approach without sufficiently 
examining its requirements and limitations and, indeed, may have 
misunderstood what this method can produce. The Angoff method was 
first recommended to nagb in a December 1989 staff paper, which 
proposed that Angoff judgments be used to set a standard of performance 
on items representing core knowledge for each grade, a purpose for which 
this method is well suited. However, the paper stated that the Angoff 
method is used to identify "core" items through panelists' yes-no 
judgments and that the method would identify naep scores that reflect 
expected performance on these core items only. As we have indicated in 
earlier chapters of this report, the Angoff method used with the naep scale 
does not set a standard for performance on core items: it sets a standard 
for performance on the test as a whole. The statement in the nagb paper is 
not only incorrect but seriously misleading. 

nagb committees subsequently recommended the use of Angoff 
procedures based on the understanding that this is the most practical of 
the item judgment methods and can be applied to an existing test. 
However, our search of staff papers, committee records, and transcripts of 
nagb meetings revealed no detailed examination of the limitations and 
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requirements of the Angoff method and provided evidence that existing 
alternative methods received only a cursory review. 3 

We also found little evidence that nagb examined the technical 
implications of its policy decision to set multiple standards using the 
entire naep test as the item base or that it understood that the resulting 
scores would not necessarily represent mastery of "core" items for each 
level 4 From what we can determine, nagb simply assumed that having 
diverse panelists judge every test item with respect to all three 
achievement levels, guided by generic definitions of the three levels, 
would produce reliable results, although the technical literature suggests 
otherwise. 

Although the December 1989 concept paper and the May 1990 policy paper 
were circulated for comment, nagb did not formally develop a technical 
design for the initial item judgment process and did not obtain a technical 
critique of the overall design before going ahead, (nagb did get reviews of 
the materials developed to orient judges but not of the overall plan. 
Moreover, the schedule did not allow for procedures to be pilot-tested.) 
When the initial design proved problematic, nagb's staff designed revised 
procedures. The redesign was only sketchily developed and reviewed 
before it was implemented, and it introduced changes that further 
departed from standard Angoff practice and made it difficult to evaluate 
the reliability of the resulting data. 



Use of Expert Advice: 1990 We found that nagb did solicit and receive sound technical advice. Most of 

the methodological and procedural issues that proved problematic were 
first raised during the period in which the policy was under consideration, 
many of them by nagb members and staff. Much good advice regarding 
improvements to the item judgment procedures was followed, in 1990 and 
for 1992. However, nagb set aside experts' early advice to proceed 
cautiously and examine alternative methods more fully before selecting a 
standard-setting approach. Perhaps more important, nagb did not respond 



3 NAGB considered using traditional NAEP anchor points or the scores for the 25th, 50th, and 75th 
percentile of students as benchmarks. We found no discussion of other methods of setting 
performance sv?ndards such as those we presented in chapter 3. 

4 NAGB's May 1990 achievement levels policy paper states that the item judgment panels would use "a 
proven judgment procedure to recommend which test questions and/or which proportion of questions 
students need to answer correctly to reach different achievement levels," (National Assessment 
Governing Board, The Levels of Mathematics Achievement , vol. 3, Technical Report (Washington, D.C.: 
1991), p. 345.) In fact, the procedure identifies only the proportion of questions to be answered 
correctly. Transcripts of the March and May meetings make clear that some NAGB members did not 
understand the Angoff method or thought that they did not have enough information about it 
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to fundamental questions about the nagb approach that emerged from the 
evaluation of its initial results. Although nagb initially emphasized that its 
approach was provisional, it has not opened it to reconsideration despite 
recommendations from several quarters that it do so. We find that nagb 
approved the 1990 achievement levels results and extended its 
commitment to the levels approach without adequate evidence that its 
procedures and results were technically sound and led to valid 
interpretations of naep scores. (Our criteria for evidence of technical 
soundness and validity were presented in chapter 2 of this report.) nagb's 
May 1990 paper noted that the 1990 levels-setting is a trial but stated 
nagb's expectations that its results would be usable for 1992 and that the 
levels would be the primary basis for reporting in 1992 and beyond. In 
November 1990, N. gb was informed of technical difficulties with the levels 
approach but nonetheless adopted a publication policy that stated that the 
achievement levels should be the primary basis for reporting naep results 
and took action to incorporate achievement levels into the request for 
proposals for 1994-96 naep operations. 

nagb approved the 1990 results for publication and solicited the contract 
proposals for the application of its levels approach to the 1992 naep 
assessments in mathematics, reading, and writing before the reliability of 
the levels results had been fully evaluated, in the absence of any 
confirmation of the validity of the interpretations given for them, and 
a gainst the advice of its evaluation team. It instructed nces to set up the 
1992 naep results in terms of the achievement levels and reaffirmed these 
instructions as recently as August. 1992, although preliminary review of the 
mathematics results for 1992 (including the statements of what students at 
each level should be able to do) had indicated that the approach was still 
possii>iy problematic. 

Our analysis of the record suggests that several factors contributed to 
nagb's decisions to move forward with the levels approach despite the 
questions that had been raised about it. As already noted, nagb considered 
the selection of test score standards to be a matter of policy judgment and 
did not recognize the degree to which validity of interpretation would be 
an issue, nagb recognized that score selection should be based on a 
defensible procedure and had been advised that although questions about 
it remained, the 1990 item judgment process met this criterion. The 
procedures used to construct the achievement level descriptions and 
illustrative items, which were based in part on standard naep methods, 
appeared def ensible as well. The benefits of sending an important message 
about U.S. students' school achievement appeared considerable, and nagb 
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saw little risk in publishing scores and interpretations that had yet to be 
fully examined, nagb's idea was that these results could be validated and 
adjusted later if necessary. 5 



Changes for 1992 nagb recognized that it had underestimated the technical resources needed 

to implement the item judgment procedures and therefore greatly 
increased the resources available for 1992. Rather than being done 
in-house, the 1992 item judgment procedures were conducted through a 
$1.34 million contract to act, a testing firm with extensive experience in 
standard-setting and an expert staff, aided by advice from external 
experts. The budget and timelines provided for analysis of the reliability of 
the item judgment results. However, the budget was sufficient to fund only 
relatively small judgment panels, which may limit the reliability of the 
results. A more significant limitation is that by specifying that the 1992 
contractor should implement its 1990 approach, nagb probably 
discouraged bidders from proposing a technically stronger design. 

Most significant of all, the contract for the 1992 standard-setting did not 
provide for validity studies to be undertaken, nagb has now remedied this 
omission and has entered into a separate contract with ACT for this 
purpose. 

Decisions concerning publication of the 1992 naep mathematics results 
had to be made before these studies could be completed. The nces 
prepublication review did not address the levels scores or the procedure 
by which they had been reached: these were taken as given. Rather, the 
review focused on the issue of valid inference — on whether nagb's 
statements and illustrations of what students at each level should be able 
to do were a valid basis for understanding what students at those levels 
actually can do. Questions arose about the achievement levels and about 
the traditional anchor points as well The report summarized these 
questions, indicated that studies to resolve them were under way, 
presented test results in terms of both methods, and urged readers to 
compare them and to join in the debate on how to reflect standards 
through naep reporting. 



Conclusion and We conclude that in the case of the achievement levels, nagb undertook a 

Implications technically complex function that it lacked the specialized knowledge to 



5 In fact, new levels scores for 1990 were calculated as a result of the 1992 standard-setting process. 
The descriptions of the skills and competencies associated with each level were also revised. 
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direct. Unfortunately, the technical nature of this function was not evident 
at first, nagb viewed the selection of achievement goals as a question of 
social judgment that nagb, by virtue of its broad membership base, was 
well suited to decide. The Angoff method and the interpretation of the 
scores selected through that method did not appear to pose technical 
problems — as indeed they might not have, if nagb had followed its original 
plan to set a single standard for each grade. The decision to set three 
standards per grade (although it was well supported on policy grounds) 
created a host of technical complications that nagb neither recognized nor 
addressed. The result of this chain of events is that nagb has given policy 
direction to nces to take actions concerning naep that are technically 
unsound. 

The implications of the achievement levels example are significant, nagb's 
authorizing statute assigns it several functions that are explicitly technical 
in nature, such as developing test specifications and the methodology of 
the assessment. Other functions, such as the developme it of assessment 
objectives, are not explicitly technical but must be performed with an 
awareness of naep's technical capabilities and limitations and of the 
requirements for accurate estimation and trend reporting. The example of 
the achievement levels raises concerns that, in the absence of sufficient 
knowledge on technical issues, nagb could give directions that might 
undermine naep's technical integrity and render it unable to serve its 
statutory purposes. 

However, the achievement levels example may be atypical. Adopting the 
levels policy was one of nagb's earliest irayor actions, undertaken at a time 
when working relations between nagb and nces had not been fully 
established. Setting performance standards was new for naep, so there was 
little direct experience to guide nagb's deliberations. Users such as the 
National Education Goals Panel were eager to see naep results reported in 
terms of standards, for use in the panel's first report on the achievement of 
national goals. The technical deficiencies we observed in the process 
might simply reflect these specific circumstances. 



To see whether nagb's actions in the levels case were typical of its 
technical decisions generally, we reviewed two additional cases. In one 
case, nagb undertook to identify the conditions under which states or 
other testing entities would be permitted to link their own assessments to 
naep and to set up procedures for evaluating and responding to proposals 
of this nature. Requests from state testing directors prompted action in 



Other NAGB Actions 
in Technical Areas 
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this case. The issue was clearly technical, and nagb recognized this from 
the first. It arranged for experts to review the policy options and drafted 
general guidelines for handling requests to link other assessments to naep, 
in which technical review was delegated to an nagb committee and to 
nces. After hearing concerns from the Council of Chief State School 
Officers, nagb arranged for further technical and constituency review so 
that the policy could be refined to fit varied circumstances. 

The issue in the other case was whether to continue to report naep results 
in terms of a single scale that covers all three grades tested (cross-grade 
scaling) or to report each grade in terms of its own scale (within-grade 
scaling). This issue raises a question of policy: should naep be designed to 
provide detailed information about variations in performance within a 
given grade, or should it sacrifice some within-grade detail in order to 
show increases in learning from one tested grade to the next? To answer 
this question, however, requires an understanding of the kinds of 
information that various types of scaling can provide. It also requires 
attention to the cumulative or grade-specific nature of the subject areas 
tested. Thus, sound policy choice depends on an understanding of 
technical issues. 

In this case, nagb initially focused on the policy question and on the 
advantages each type of scale has in principle. Responses from nces and 
ets, among others, suggested that a specific review of the current 
cross-grade scaling and of the effects of changing to within-grade scaling 
for naep should be undertaken before a decision was made, nagb 
subsequently obtained such a review through nces's technical review 
panel. This review supported nagb's initial preference for within-grade 
scaling but also found that naep's cross-grade scaling is well implemented. 
A member of nagb who is also a curriculum expert argued strongly that 
topics cannot be neatly divided by grade level in some subject areas and 
that options should be left open, nagb adopted a policy that calls for the 
use of within-grade scaling where feasible and appropriate. Each subject 
area consensus group will be asked to determine which type of scale best 
fits its needs, nces has noted that these groups, like nagb itself, contain 
few members whose training would allow them to grasp the technical 
implications of this question. 

Both of these cases proceeded differently from the achievement levels 
effort, nagb recognized that each case involved technical issues on which 
it should seek expert advice, nacb arranged for technical information to be 
analyzed and for issues and alternatives to be fully examined by technical 
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experts. In both cases, nagb postponed action on a draft policy when an 
important constituency asked for further review. In both, nagb abandoned 
an initial "one size fits alT approach for a policy that could be adapted to 
varied circumstances. Finally, nagb's actions and policy decisions in both 
cases made use of the technical resources available through nces and were 
consistent with the expert advice that nagb received. 

These three cases indicate that as might be expected of a body designed to 
represent a broad spectrum of constituency opinion, nagb has tended to 
focus initially on the policy dimensions of the issues that have come 
before it. Its handling of the technical dimensions has varied. In one case, 
nagb immediately recognized the technical aspect of the policy issue 
(linking state tests to naep), sought expert review of the options, adopted 
initial decision guidelines, and delegated implementation to technical 
experts in nces. In the second, nagb recognized immediately that 
within-grade scaling was a technical issue as well as a policy issue, 
provided for in-depth technical analysis after learning of nces and ets 
concerns, but delegated decisions to groups that include only limited 
expertise in scaling. In the third case, nagb conceived of standard-setting 
as a policy function that it should itself perform, did not adequately 
examine whether the approach it designed was technically sound, and set 
aside technical experts' concerns. 

We conclude that nagb has access to significant technical resources 
through nces and through its own ability to contract for expert services. 
When nagb recognizes the technical dimension of a policy area, it can use 
these resources appropriately to inform policy planning and to implement 
policy guidelines. However, nagb may not recognize that a policy has 
important technical dimensions or may subordinate technical quality to 
policy requirements. In such cases, nagb could unknowingly (or 
unintentionally) give policy direction to naep that is not technically sound. 



Contributing Factor; 



In search of explanations for the problems evident in the achievement 
levels case and for strategies that might ensure that nagb follow the useful 
practices observed in the two other cases, we examined three factors: the 
structure, responsibilities, and capabilities of nagb and of nces, nagb 
membership, and nagb operating procedures. 



Structure, Responsibilities, 
and Capabilities 



We examined naep's governance structure to determine the protection it 
offers against the receipt of technically unsound policy direction. We also 
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considered whether such protection as does exist is likely to be effective 
and looked for structural and procedural features that foster policy 
direction that is both technically sound and responsive. 

By statute, naep is governed through a structure of two units, each with a 
unique strength. (Appendix IV summarizes the statutory structure.) 

One unit, nagb, is designed to be broadly representative, nagb is a lay body 
composed of members of key constituencies (state and local officials and 
educators, citizens, and two experts in measurement) who meet several 
times a year with committee activities between meetings; it is assisted by a 
staff of six professionals, nagb is independent of the Department of 
Education. It has a general responsibility to formulate policy guidelines for 
naep and to advise the Commissioner of Education Statistics (who heads 
nces) regarding the conduct of the assessment. The Secretary of Education 
(through the commissioner) reports to nagb regarding actions to 
implement nagb's decisions. 

nagb is responsible for carrying out specific functions or responsibilities, 
which it may delegate to its staff. These include selecting the subject areas 
to be addressed; identifying appropriate achievement goals; developing 
assessment objectives; developing test specifications; designing the 
methodology of the assessment; developing guidelines and standards for 
analysis, reporting, and dissemination; developing standards and 
procedures for interstate, regional, and national comparisons; and taking 
action to improve the form and use of the assessment, nagb has the final 
authority on the appropriateness of cognitive items, is directed to ensure 
that items are free from bias, and directs the consensus process through 
which test objectives (the content areas to be covered on each test) are 
established. 

The other unit, nces, is characterized by technical expertise, nces is staffed 
by full-time technical experts and has access to others through its advisory 
- committees. It has access to additional experts through ets as the 

technical contractor for naep and through the technical advisory body that 
ets is required to consult, nces is responsible for administering naep and 
for ensuring that the assessment is fair and accurate, nces is also 
responsible for conducting review and validation studies of naep and has 
established a technical review panel for this purpose. 

The Commissioner of Education Statistics, who heads nces, is the 
guardian of the quality of statistical data produced uftder his or her 
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supervision, including naep data, nces reviews all statistical data prior to 
their publication, using standards established in consultation with the 
Associate Commissioner for Statistical Standards and Methodology (by 
law, a highly trained expert) and the Advisory Council on Education 
Statistics. 

The potential for conflict in this structure — including the possibility that 
nagb might give nces policy direction that the commissioner, as guardian 
of technical quality, could not implement — has long been evident. 6 
Designed to ensure that policy guidance for naep is free of inappropriate 
influences, it gives nagb responsibility for many functions that are highly 
technical but does little to ensure that nagb's judgments are technically 
well informed. The structure makes nces responsible for naep's technical 
quality, but nces's primary technical quality control mechanism, the 
prepublication review process, comes into play only after a policy has 
been implemented and produced results. Moreover, the nces review 
process does not apply to reports published under nagb's independent 
authority. (As already noted, this was the path followed for nagb's 1990 
achievement levels publications.) 



NAGB Membership Our review suggests that nagb needs to have enough technically trained 

individuals among its members to ensure that the technical implications of 
policy issues are recognized early and are given appropriate attention 
throughout the policy planning process. Its formal structure provides for 
two members who are experts in testing and measurement. During the 
period covered by our review, however, only one of these two had strong 
technical training; the other person's expertise lay in assessment policy 
and implementation. 

We conclude that it is not sufficient for nagb to have only one member 
with strong technical training. A single individual's efforts may be spread 
too thin; his or her absence from a particular meeting will leave an 
important perspective unrepresented; and a single spokesperson for an 
unfamiliar view is unlikely to prevail in a group discussion. The two 
positions specified in the law are a bare minimum for adequate 
decisionmaking and should probably be increased. 
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Operating Procedures We located two sources of procedural protection against technically 

unsound policy direction: the memorandum of understanding between 
nagb and the Department of Education signed in April 1992 and nagb's 
operating policies. 

The Memorandum of The memorandum of understanding was negotiated in an attempt to 

Understanding resolve the conflicts and ambiguities in naep's governance structure. The 

memorandum commits the department to "make every reasonable attempt 
to implement the policy-setting actions" of nagb and specifies that when 
such actions cannot be accomplished, the department and nagb will seek 
mutually satisfactory resolutions. This specification suggests that if nagb's 
direction conflicts with the commissioner's statutory quality-assurance 
responsibilities, the commissioner may legitimately inform nagb that he or 
she cannot follow its instructions and may refuse to accept as satisfactory 
a solution that is not technically sound. 

We found that while this memorandum appears to give the commissioner 
the right to act in accordance with his or her judgment and statutory 
responsibilities if given technically unsound direction, it does not fully 
resolve the issue. Judging from the continued negotiation over whether 
1992 naep results should or should not be reported in terms of 
achievement levels, it is not yet clear what constitutes a "reasonable 
attempt" or what will happen if a mutually satisfactory solution cannot be 
found. We were assured by nagb's executive director and by its members 
that nagb can envision circumstances under which the commissioner 
could legitimately be unable to follow nagb's directions, nces officials, 
however, appeared unconvinced that nagb recognizes the commissioner's 
independent responsibilities. 

NAGB Policies Among the nagb policies pertinent to the technical quality issue, we found 

that the "Policy on Policies" (December 1989) states that nagb policies that 
address its legislated responsibilities will state the end to be achieved but 
not the means of policy implementation: the means will be left to the 
implementor. However, this restriction was not observed in the levels 
example: nagb's levels policy specified both ends and means. In 
commenting on this observation, nagb explained that the levels policy 
covers both ends and means because the identification of achievement 
goals is a function specifically assigned to nagb by statute, and nagb itself 
was the implementor. 

The functions assigned by statute, however, are the very functions to 
which this policy is directed. In light of nagb's comment, it seems 
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important to clarify whether nagb is responsible only for policy with 
respect to these functions (with the expectation that implementation will 
be delegated to nces or to a qualified contractor) or whether it is 
responsible for implementation as well and can choose not to delegate. 

The "Policy on Policies" also requires public or expert involvement or both 
prior to final nagb action and permits (but does not require) nces, the naep 
contractor, and naep administrators to make suggestions concerning 
policy alternatives. In the levels case, we found that nagb offered such 
limited information so little in advance of public hearings that technical 
experts found it difficult to offer useful comments. In our view, mere 
pentiission for nces and naep experts to suggest alternatives is a weak use 
of a major technical resource. 

nagb's "Policy on Reporting and Dissemination" (November 1990) requires 
all naep reports to follow nces review and clearance procedures and to be 
free from interpretations that are not supported by data; it also specifies 
the use of the levels framework as the primary means of reporting from 
1992 on. 7 This policy also gives nagb the power to issue companion reports 
that may be more speculative and interpretive than naep reports but must 
be clearly distinguished from them, nagb has not established quality 
standards for its own reports. 

We find that these policies and associated procedures offer only weak 
protection against technically unsound direction and reporting, especially 
since nagb is apparently not compelled to delegate policy implementation 
to knowledgeable experts. 



We conclude that nagb's knowledge resources and procedures do not 
provide reasonable assurance that work done at its direction will be 
technically sound, nagb's governance structure; membership, and 
procedures offer little protection against technically unsound policy 
direction for naep and do little to encourage strong technical input 
throughout policy formation, nagb is capable of using the technical 
resources that are available to it but may fail to see the need or may 
choose not to do so. 

nagb is independent and works at its own direction. If it mandates a 
technically uninformed course of action with respect to a function for 



7 NAGB officials assured us that the levels will be used only if they pass NCES review; however, the 
policy does not state this explicitly. 



Conclusion 
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which nces is responsible, the only recourse is for the commissioner to 
refuse to carry out nagb's instructions— in the case of data reporting, to 
follow nagb's instructions knowing that they are likely to be overturned as 
a result of the prepublication review. This arrangement is wasteful in that 
it allows errors to be detected only after time and resources have been 
expended on flawed work, and it puts the commissioner in a difficult 
position. 

Considering all the evidence, we conclude that the risk remains high that 
nagb may fail to recognize — even after advice from technically 
knowledgeable experts— that a policy issue has critically important 
technical implications and, thus, may give unsound technical direction to 

NAEP. 



^ We recommend a number of steps that nagb should take to ensure that 
KeCOmmenaatlOnS technical aspects of proposed policies receive early and expert attention 

and that the technical quality of all publications is maintained. These steps 
can be taken within the existing structure and do not require any change in 
legislation. We also recommend that the Congress review the division of 
functions between nagb and nces, with a view to aligning those functions 
more closely with organizational strengths and capabilities. 



Recommendations to To ensure that it does not formulate and adopt technically unsound 

NAGB policies or approve technically flawed results, we recommend that nagb 

1. obtain nces review of the technical strengths and weaknesses of 
proposed policies that implement nagb's statutory responsibilities, prior to 
final decision on such policies; 8 

2. analyze the probable effect of proposed policies (such as the 
achievement level policy) on naep's ability to present achievement fairly 
and accurately and to support trend reporting that is both valid and 
reliable; 

3. pilot test and thoroughly evaluate any new design or analysis procedure 
before it is fully implemented and results are reported; and 



*When a policy has implications for test design, technical experts should be involved as early as 
possible in the poUcy planning. Our recommendation is intended not to preclude NAGB from 
commissioning technical reviews from independent experts but simply to ensure that the experts who 
know NAEP are fully consulted. 
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4. adopt standards of technical quality (to be applied internally) for 
publications issued under its own authority and also secure competent 
external technical review of such publications prior to authorizing their 
release. 

We recommend that the Chairman review actions nagb has taken with 
respect to its statutory responsibilities in the past 2 years, identify those 
whose technical consequences havernot been sufficiently examined, and 
secure technical review as necessary to ensure that these actions will not 
generate unanticipated technical difficulties in the future. 

We also recommend that the Chairman of nagb review each proposed 
policy to ensure that nagb prescribes policy ends, not technical details of 
implementation. 

With respect to nagb membership, we recommend that nagb nominate for 
the two testing and measurement positions only persons with relevant 
professional qualifications who are trained and experienced in the design 
and analysis of large-scale educational tests. To further add technical 
expertise within its currently mandated membership structure, nagb 
should also ensure that two or more of its elected officials, educators, and 
representatives of the general public iiave significant technical knowledge 
and experience. 



Recommendations to the We recommend that the Congress clarify the division of responsibilities 
Congress between nagb and nces, with a view toward concentrating nagb's efforts 

on the functions for hich its broad representation is an asset and toward 
distinguishing functions nagb itself is to implement from matters on which 
it is to give policy direction or advice to the commissioner. While nagb as it 
is currently constituted can appropriately advise the commissioner from a 
constituency perspective regarding functions that are technical (such as 
the method and design of the assessment), it does not have the technical 
resources to carry out these functions and should be relieved of this 
apparent responsibility. When t!he Congress has more clearly determined 
what nagb's functions should be, it should review nagb's membership and 
determine the number of technically trained members needed. 



ERIC 



O Page 65 GA0/PEMD-93-12 Educational Achievement Standards 



Appendix I 



Comments From the U.S. Department of 
Education 



Note: GAO comments 
supplementing those in the 
report text appear at the 
end of this appendix. 




UNITED STATES DEPARTMENT OF EDUCATldiN 



t »H i h : \ ; .1 \ ., ■ ■ v. ■ .v 




MAR 2 6 1993 



MAR 25 1992 



Ms. Eleanor Chelimsky 
Assistant Comptroller General 
General Accounting Office 
Washington, D.C. 20548 

Dear Ms. Chelimsky: 

The Secretary has asked that we respond to the draft report titled, Educational 
Achievement Standards: NAGB's Approach Yields Misleading Results, which was 
transmitted on February 22, 1993. Your report is timely because the current law 
expires this year and the Administration and Congress will be considering appropriate 
legislative changes for the National Assessment of Educational Progress (NAEP). 
Many of the findings, comments and recommendations are directed to actions of the 
National Assessment Governing Board (NAGB). The Board is responding separately 
to your invitation for comments and is including its views on, among other things, 
changes in its achievement level setting process for 1992 that were not detailed in your 
report. These changes are also a significant part of the overall effort that the Board 
has been conducting to fulfill a provision of the law that directs them to identify 
"appropriate achievement goals" for each age and grade tested in NAEP. Our 
comments are addressed to those issues related to the Department of Education and its 
National Center for Education Statistics (NCES), as well as to more general issues 
about standard setting and statistics. 

While the General Accounting Office (GAO) report deals primarily with technical 
aspects of NAGB's actions to set achievement levels, in fact, the concept of 
performance standards involves much more than that. Any attempt to establish 
performance standards raises questions of substance: what it is that we want 
American students to know and be able to do, and how well we expect them to do it. 
Performance standards also raise questions of public policy: whether our national 
assessment should lead, or should follow, student learning progress, and how we 
decide, as a nation, what the standards should be. The Governing Board is attempting 
to set performance levels to challenge American students. The National Education 
Goals and the Administration's legislate proposal. Goals 2000: Eaucate America 
Act, support that position. On the oner hand, the Governing Board also supports 
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gathering high quality data on trends in student performance. The task of balancing 
these two purposes is challenging, but necessary. The GAO report, however, appears 
only to support the limited trend monitoring role for the National Assessment. We 
believe that each of these roles, properly executed, can serve a constructive purpose in 
informing the public. 

The National Center for Education Statistics is supporting several studies about 
standard setting for NAEP, and specifically about the achievement levels adopted by 
NAGB, through the National Academy of Education. V/hen completed, these studies 
will help to inform the national debate about standards-based education. But the 
policy and substantive issues, not just the technical ones, will be critical in 
deliberations of Congress as it deals with the Administration's plan for Goals 2000. 

1092 DATA RELEASE 

The National Assessment of Educational Progress (NAEP) is a complex project 
conducted by NCES with policy guidance from NAGB. Among other things, NCES 
has the responsibility to ensure that: (1) all reports and releases of NAEP data meet 
accepted standards of technical soundness, and (2) the data are released in a timely 
manner. As you are aware, the Center has awarded a grant to ±c National Academy 
of Education to evaluate the Trial State Assessment, including achievement levels. It 
has also contracted with University of California at Los Angeles (UCLA) to conduct 
studies of many features of NAEP, including achievement levels. These studies and 
our decisions about appropriate follow up action are ongoing. At the same time, 
States have invested significant resources in NAEP and many are depending on timely 
availability of data for critical policy decisions. In the meantime, however, NCES has 
tried to balance its responsibilities for sound and timely reporting of the 1992 NAEP 
data with NAGB's policy directive that achievement levels be the first and primary 
method of reporting the 1992 results. 

NCES and its contractor, Educational Testing Service (ETS), have prepared a report 
of the 1992 mathematics assessment in the Nation and the States using the 
achievement levels as one of the primary means of reporting the data. But to ensure 
high technical standards appropriate for a Federal statistical agency, the report also 
includes detailed information about means and distributions of student performance 
and also about anchor points— descriptions of what students know and can do at 
various points on the NAEP scale— similar to reports issued since 1985. 
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The data will be released at^a press conference on April 8, 1993, in a report titled, 
1992 Mathematics Report Card for the Naticr and the States. The report will contain 
appropriate caveats about the achievement levels based on a number of studies, some 
of which are still in process. NCES will also release a report tided, Interpreting 
NAEP Scales, which will contain a full discussion of different methods of reporting 
and interpreting NAEP data. 

Since January, NCES has taken a number of steps to invite advice about what are 
appropriate inferences that can be made from the achievement levels. NCES has 
taken the issue to NAEP's Design and Analysis Committee, to the committee that 
advises NAGB's achievement level process, and to a March 3 meeting with many 
experts and interested parties to review studies on deriving appropriate inferences 
about student performance with respect to the achievement levels conducted by 
NAGB's contractor and the Center's Technical Review Panel. At this point, it is still 
not clear what inferences can confidently and accurately be drawn about student 
performance at the achievement levels. Furthermore, it appears that the development 
of the descriptions and the levels themselves are still in process and may be different 
in 1994 from what they were in 1992. 

NCES will continue to seek technical advice about the achievement levels and, in 
subsequent releases of the 19T Z reading and writing assessment data, may issue 
separate "research and development" reports with the achievement levels as the means 
of reporting and other data reports without achievement leveis. 

RESPONSE TO RECOMME NDATIONS 

Recommendation: (p. 2-38) In light of the many problems we found with NAGL 
approach, we recommend that NAGB withdraw its direction to NCES that the 1992 
NAEP results be published primarily in terms of levels. The conventional reporting 
format should be retained until an alternative has been shown to be sound. Second, 
we recommend that NAGB and NCES develop a joint plan and schedule for review of 
NAGB's achievement levels approach, taking into account evaluations that are 
currently under way and providing for additional activities as needed. 

We are continuing external evaluations of achievement levels. NCES views 
development of alternative approaches as an iterative process in which there should be 
an opportunity for the publfintp know what is at issue and participate in the 
conversation on an informed baris. In the meantime, we are continuing our 
conversations with NAGB about appropriate reporting in future NAEP reports and 
Center reports continue to provide information about means and distributions of 
student performance and anchor points. 
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Recommendation: (pp. 3-21, 22) In view of the conceptual and technical flaws 
inherent in NAGB's achievement levels approach (see chapter 2) and of the many 
questions that need to be resolved before an alternative standard-setting method can be 
selected, we recommend that NAGB withdraw its policy of applying the 1990 
achievement levels approach to future NAEP tests and join with NCES in exploring 
alternatives for setting both content-based and overall performance standards with 
respect to NAEP. This inquiry should examine issues of purpose, technical feasibility, 
cost, fairness, credibility and usefulness. 

We agree that a collaborative effort to explore alternative ways of setting standards 
would be a constructive activity. It would also be timely since the national curriculum 
standards projects are, in various ways, exploring the same question. Eventually, this 
may necessitate design changes in the administration and scaling of NAEP. The 
approach used in the future needs to take into account both the content and 
performance information needs implied by standards such as those developed by the 
National Council for Teachers of Mathematics and those to come in other, subject 
areas, the information needs of the National Education Goals Panel, the^need to 
measure trends over time, and the need to monitor the performance of the Nation and 
the States in comparison with National Education Goals. The approach used in the 
future should be beyond technical reproach, and provide a way to link standards with 
test-based data about what students "can do" at each achievement level. 

Recommendation: (p. 3-22) We recommend that the Congress specify what it intends 
in directing NAGB to identify appropriate achievement goals: whether it envisions the 
establishment of overall performance standards, the establishment of content-based 
performance standards, or simply better alignment of test coverage with content 
mastery standards. Given that legislation to establish a mechanism for adopting 
national content standards is currently under consideration, the Congress may also 
wish to express specific guidance with respect to activities to align NAEP to content 
standards before these mechanisms are in place. 

In general, we believe that it would be appropriate for Congress to provide more 
specific guidance as to what it means by the "appropriate achievement goals" phrase in 
the law, and we will consider this issue in drafting our reauthorization proposal. In 
that regard, at least one possible optional definition has been omitted from this 
recommendation: that achievement goals might be the setting of targets for the 
proportion of fourth, eighth, or twelfth grade youth who should demonstrate mastery 
of various areas of knowledge or skills. 
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Recommendation: (p. 4-23) We recommend a number of steps that NAGB should take 
to ensure that technical aspects of proposed policies receive early and expert attention 
and that the technical quality of all publications is maintained. These steps can be 
taken within the existing structure, and do not require any change in legislation. We 
also recommend that Congress review the division of functions between NAGB and 
NCES, with a view to aligning those functions more closely with organizational 
strengths and capabilities. 

We concur that stf itoiy. clarification of the important roles that NCES and NAGB 
have to play in the NAEP project could serve a constructive purpose, and we will also 
consider this as we develop our reauthorization proposal. NAGB is well suited to 
provide broad policy advice by representing the many constituents served by the 
NAEP project. NCES is well suited to provide the operational and technical expertise 
needed to conduct a complicated survey like NAEP. Both functions are needed in 
order to ensure that the assessment data are technically valid and reliable and, ai the 
same time, policy relevant and worth the expenditure of considerable public funds. 

OTHER COM MENTS ON THE REPORT 

As stated on p. 2-3, the NAEP test has historically been designed to "describe the 
range of performance... and to measure the aveiQ&L accurately. " However, NAGB 
sought to use the NAEP scale for a different purpose: to measure student 
performance in terms of standards of what students at three levels of achievement... 
should know and be able to do. This different purpose was not ideally served by the 
current NAEP framework and item pool. In the future, NCES and NAGB can 
explore design and scaling modifications that might be needed to support achievement 
levels more fully. 

We disagree with your observation on p. 2-6 that measuring content would require 
"measuring achievement in terms of a separate scale for each level. " There is nothing 
in the NAGB procedure that would require development of a scale for each level. 
Although there are enough items to develop six subscales (algebra, geometry, 
estimation, etc.), there are not enough items to develop nine achievement level scales. 
NAGB did not develop achievement levels for each of the six subscales because this 
would have resulted in 18 standards (basic, proficient and advanced for each of the six 
subscales) that the current item pool supply could not support. 

As stated on p. 2-7, it is true that there are many different patterns of answers that 
achieve a given percent correct. In fact, NCES and the Educational Testing Service 
(ETS) originally recommended that the pattern of rater-responses (not the percent 
correct) be used in setting the standards. This would have meant that the standards 
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would have been set in the same way the test is actually scored for students. This 
idea was rejected by the advisors to NAGB on the grounds that the percent correct 
approach was (a) more consistent with the traditional Angoff procedure, (b) easier to 
explain, and (c) yielded the same results (on the average) as the more complicated 
pattern of responses. NCES believes that use of the pattern of responses would have 
been a better approach, but it probably would npi have substantially changed the 
standards that were selected. 

The report's observations on pp. 2-7 and 2-8 that the m NAEP tests were not designed 
with occupational skills or advanced college course prerequisites in mind. . .predictive 
standards cannot be based on judgment alone; they must be backed by factual 
information * are true. Such issues in external validity will be considered by the 
National Academy of Education as part of its four external validity studies. Of 
course, NAEP is not meant to be predictive. 

We agree with the report's observation on p. 2-12 that one reason the judges may 
have set such high standards is that they did not have the disciplining experience of 
comparing their personal estimates of what students at a given level sdU do with what 
students like those at that level actually did- Our understanding is that at the time the 
standards were first set, the judges did not know how well the students were doing on 
each item within the range of each standard. Also, the judges did not know how 
many students met or exceeded each standard. Had they used this type of "reality" 
feedback they may have set a different standard. A related issue (on p. 2-13) is that 
the judges were not given information about the probability of guessing the correct 
answer for each item. Again, such information may have affected the judges ratings. 

We are puzzled by the finding on p. 2-18 that basic and proficient level students did 
better than they should have according to NAGB's judgments on the items relevant to 
that level. By contrast you found (p. 2-19) that students at the advanced level did less 
well than they should have according to NAGB's judges. All of the evidence NCES 
has seen to date indicates that what students "can do" at the various levels is less than 
what NAGB's judges say they "should do" at each level. This led NCES to ask the 
NAEP Technical Review Panel to conduct several studies on this topic and to expand 
discussion by holding meetings on these issues. 

We are not sure that your conclusion on p. 2-24 is correct when you state that the 
performance on advanced achievement levels "is extreme even by world standards." 
We cannot tell from your report what equating methodology you used. Initial results 
from analyses conducted by the NAE and ETS indicate that the high achieving 
countries on the International Assessment of Educational Progress (IAEP) did quite 
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well at the advanced level. If this is true, then this implies that the advanced level is 
not extreme relative to the highest performing countries in the IAEP study. 

We find the statement on p. 2-29 that NAGB's interpretations "confound what 
panelists thought students should do with what they actually could do " consistent with 
the NCES experience during adjudication of the 1992 mathematics reports. Various 
reviewers wanted to substitute "can do" for "should do' in the report. This led NCES 
to seek evidence of the extent to which the achievement level descriptions might 
interchangeably be used to describe what students can do. 

The external validity evidence on 1992 achievement levels mentioned on p. 2-33 is 
under study by the National Academy of Education. Four studies on this topic (along 
with six studies on internal validity) are being conducted. The results will not be 
available until Summer 1993, however. 

There is a misstatement on p. 2-35. Although NAGB's policy requires that NAEP 
reports meet NCES technical quality standards, the law does not specifically contain 
such a requirement. Such a requirement would clarify the implied NCES role and 
would result in an explicit sanction for NCES technical input at the design, 
implementation, analysis, and reporting stages of NAGB's activities. 

We would find it helpful if your discussion beginning on p. 3-1 about potential forms 
for student performance standards were expanded to include more options. For 
example, standards could be written into the scoring rubrics for performance 
assessments (NAEP has done something like this with past writing assessments). 
Alternatively, standards can be grounded in external criteria (such as performance on 
the Advanced Placement Test, ACT, or SAT) which serve as a benchmark for NAEP. 
A wide array of approaches to standard setting should be reviewed before decisions 
are made about future procedures to set standards for the NAEP test. We agree with 
your observations on p. 3-13 that the NAE will "provide reliable information about 
the feasibility and usefulness of alternative standard-setting methods." 

As noted on p. 3-16, setting standards on an existing pool of items may lead to biased 
standards if the pool is not sufficient. In the case of the 1990 and 1992 NAEP 
mathematics assessments, the pool of items was representative of the content domain 
to the extent that NAGB's item and test specifications are explicit. If the item and test 
specifications were more explicit, then the item pool could more fully represent the 
content of the frameworks. 
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We agree with p. 3-18 that NAEP cannot serve the dual purpose cf providing content 
standards as well as overall performance standards. In fact, we feel that cne of the 
fundamental difficulties with the approach taken by NAGB was that the achievement 
levels were developed with the goal of overall performance standards, whereas the 
written descriptions had the goal of describing content standards. 

It is true, as stated on p. 3-20, that the NAEP samples could be drawn in such a way 
that different types of standards could be applied to different nationally representative 
samples. Some samples could serve the puqxDse of monitoring trends, ouiers could 
provide multiple ways of assessing standards, and other samples could be used for a 
host of experimental and innovative approaches to assessing students. Such design 
changes will be considered in the future as NAGB and NCES work to find better ways 
to implement standards-based reporting. 

We appreciate the opportunity to review the draft report and we hope that our 
comments will be helpful to you in producing the final document. Please do not 
hesitate to contact us if you have any questions on our comments or for more 
information. 



Sincerely, 




Emerson J. Elliott 
Acting Assistant Secretary 
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The following are gao's comments on the U.S. Department of Education's 
March 25, 1993, letter. 



1. The department observes that performance standards involve questions 
of what students should know as well as of how well they should do on the 
test and that our report deals only with the latter. We agree that the two 
issues are bound together. Ideally, deciding what students should know 
(and at what level of difficulty) should be the first step in a 
standard-setting process and should guide the design of the test. (See table 
3.1.) Proceeding in this sequence helps ensure that the test and the maimer 
of scoring are appropriate to the standards and that test scores can be 
interpreted in terms of the standards. In the case we studied, test content 
and scoring practices were in place before the standards were framed. The 
scores and the standards represented somewhat different concepts of 
performance, which led to problems in interpretation. We have added to 
chapter 3 to clarify that test content is important to the establishment and 
interpretation of overall performance standards. 

2. The department comments that we appear to support only a monitoring 
role for naep. We have amended the text of chapter 3 to make clear that 
naep tests can be used to set standards of overall performance on material 
that all students are expected to know. 

3. Our observation about subscales was not intended to refer to subscales 
by content area. Since nagb's approach required the use of the naep scale, 
the point we were making was somewhat moot and has been dropped. 

4. We are interested in the department's observation that using a pattern 
approach rather than the percent correct approach to find naep scores 
matched to the item judgments would not have made much difference in 
the scores selected. The pattern approach should correct for one possible 
source of error (that is, for differences in item weighting between the item 
judgment procedure and naep scaling). It does not correct for 
overestimates of basic-level student performance on very difficult items. 
Our analysis assumes both corrections. 

0. The department remarks that our analysis of student performance on 
items of varied difficulty produced puzzling results. As we note in chapter 

1, our analysis represents a new way of looking at student performance on 
nae p . We chose this method because it seemed related to the concept of 
performance expressed in nagb's definitions. We are pleased that studies 
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of student performance at various naep scores are continuing, and we look 
forward to learning of the results. 

6. The department notes that the statute that authorizes nces and naep 
does not specifically require that naep reports meet nces technical quality 
standards. Upon rereading the statute, we agree with this observation. 
(The statute provides for the establishment of such standards and requires 
that naep reporting be fair, accurate, reliable, and valid but does not 
expressly state that naep reports must meet nces standards.) We have 
deleted reference to the statute from our text. 
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THE HAHOH'S 
REPORT 
CARD 



1; 



National Assessment Governing Board 

National Assessment of Educational Progress 



March 23, 1993 



Eleanor Chelimsky 

Assistant Comptroller General 

Program Evaluation and Methodology Division 

General Accounting Office 

Washington, DC 20548 

Dear Mrs. Chelimsky: 

Enclosed herewith is the response from the National Assessment Governing 
Board to the draft GAO report concerning the setting of achievement levels by the 
Board, and the Board's use of technical expertise. 

This response was approved for transmittal to the GAO by the Board's Executive 
Committee on March 19, 1993. Attached also, is a short paper from Ronald Hambleton, 
the Board's chief consultant for its 1990 effort; a paper from American College 
Testing, the Board's lead contractor for the 1992 levels-setting; and a paper from 
Gregory Cizek, an expert in standard setting, who was uninvolved in either the 1990 
or 1992 achievement levels activities. 

Please do not hesitate to contact me if you have questions regarding our 
response. 

Sincerely, 

Mark D. Musick 
Chairman 



RECEIVED 

MAR 2 3 1993 

uAO pemd 



Enclosure 



800 North Capital Street N.W. 
Suite 825 
Mailstop 7583 
Washington. D.C 20002-4233 
(202) 357-4938 
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See comment 2. 

See comment 3. 
See comment 4. 



See comment 5 



Introduction and Highlights 



The General Accounting Office's draft report on achievement levels for the National 
Assessment of Educational Progress is based on the same misunderstandings that appeared more 
than a year ago in the agency's interim report. It reflects the same fundamental disagreements 
about the value and nature of standards for educational performance. 

In summary, the National Assessment Governing Board makes these main points: 

• National Assessment results should be reported primarily in terms of challenging 
standards that help the nation determine "how good is good enough." The 
conventional practice of simply comparing one group of students to another is no 
longer adequate. GAO makes no compelling argument for returning solely to the 
older methods of reporting by means, percentiles, and "benchmarks/ 

• The Board and numerous other groups believe that achievements ievels can properly 
be used to report results on the National Assessment. We reject the argument that 
trying to set standards on NAEP is "conceptually flawed." We reject GAO's 
recommendation that the 1992 achievement levels be withdrawn. 



• The GAO report is unbalanced and misleading, 
undocumented; much of its analysis is flawed. 



Many of its assertions are 



• The GAO report is out-of-date. It focuses on the achievement levels for 1990- 
indeed, mostly on the first phase of the process for setting them which did not form 
th- basis for the levels actually adopted. It gives relatively little attention to the 
stahJard-setting process for 1992 and fails to recognize the improvements made. 

The process for setting the 1992 achievement levels was conducted under a $1.5 million 
contract by American College Testing (ACT), which has extensive experience in standard-setting 
in many fields. ACT consulted regularly with a panel of leading experts in measurement and 
standard-setting who believe strongly in the feasibility of setting standards on NAEP and in the 
soundness of the process used to advise the Board on what the levels should be. 

The movement from norms, based on test averages, to standards, based on informed 
judgment of what students ought to know and do, is occurring not only on NAEP but in many 
parts of American education. It stems from dissatisfaction with "national norms, H which by 
definition place half of all students below an average score that may be woefully inadequate. 
The movement to standards also reflects the conviction that setting clear markers of what 
students should learn makes any test far more useful and meaningful to parents, schools, and the 
public. 
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Yet. the authors of the GAO report seen? cool to this central idea. They frame the issue 
as "statistical quality," not policy judgment. They suggest alternatives that would not really 
yield standards at all, just norm-referenced descriptions of performance. For example, the 
Board rejects the kind of "benchmark" example suggested by GAO in which acceptable 
performance is arbitrarily set at the 30th percentile of student achievement. 

The report seems premised on two major misinterpretations. First, it fails to recognize 
the extent to which setting test standards involves policy judgment rather than a technical process 
to find an "accurate" score. Second, in contrast to what the report asserts, standards often are 
set on tests quite similar to NAEP using the same system of collecting judgments-the Angoff 
procedure. Far from being "novel," the procedure is widespread. 

In arriving at the standards, most of the experts on whose judgments NAGB relied were 
classroom teachers, bringing first-hand experience from many parts of the nation. The standards 
adopted contain reasonable descriptions of what students should learn. They are meant to denote 
overall levels of proficiency, well-suited for placement on the NAEP scale, not checklists of 
specific skills. 

The GAO report relies on outmoded models of psychometric evaluation. In particular, 
it conceives of validity as an all-or-nothing proposition when it properly is a matter of degree, 
basea on the weight of the evidence and the uses made of results. 

NAGB believes that using standards on NAEP is a developing process. It has adopted 
preliminary descriptions of the levels as part of the frameworks for 1994 NAEP exams, and is 
certain there will be other changes over the years to make achievement standards a primary 
factor in creating NAEP assessments as well as in reporting them. It believes strongly, though, 
that any improvements that may occur in the future do not detract from the overall soundness 
and utility of the 1992 NAEP achievement levels being developed by ACT. 

The Governing Board agrees with GAO about the importance of securing technical 
advice, and has done so regularly in regard to achievement levels, as well as in its other work. 
However, because of the wide impact of NAEP. the assessment should be guided by an 
independent, widely-representative policy-making board-not a closed circle of federal officials 
and technicians. 

Appended are comments by ACT; Ronald Hambleton. of the University of Massachu- 
setts; and Gregory Cizek. of the University of Toledo. 
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See comment 1 1 . 



From Norms to Standards: A Basic Policy Issue 

There is an issue of basic policy underlying both GAO's report and NAGB's comments: 

Should the National Assessment present results in terms of performance standards rather 
than simply using scale scores and proficiency levels, based on the distribution of test results? 
Should NAEP include judgmental standards of what achievement ought to be? 

The Board believes standards • *e essential, and it is responsible for setting them under 
NAEP's 1988 authorizing legislation. Tne law calls for "appropriate achievement goals for each 
grade... and subject area to be tested under the National Assessment/ and these are clearly 
meant to differ from NAEP's previous descriptions of actual performance. 

By giving an informed, deliberate judgment about "how good is good enough," the 
achievement levels make the National Assessment far more useful and meaningful to policy- 
makers and the public. They also help NAEP play its crucial role in tracking progress toward 
the national education goals. It is no longer enough, the Board believes, for NAEP simply to 
report who is above average and who is below without giving a sense of what should be 
expected. 

A similar interest in reporting by standards rather than by norms is at the heart of 
reforms in many state testing programs and of the New Standards Project. Yet, the authors of 
the GAO report seem cool to the idea. They frame the issue as "statistical quality" not policy 
judgment and repeatedly counsel "reasonableness" and "realism," apparently code words for not 
expecting too much. 

Most of the report's suggested alternatives for standard-setting would simply describe the 
performance of various groups, not deal with the substance of what students should know and 
be able to do to move beyond the status quo . Although the report ostensibly takes a technical 
stance, its authors' policy views clearly color their descriptions, interpretations, and conclusions. 



Two Major M isinterpretations 

The report seems premised on two major misinterpretations of NAGB's standard-setting 
effort. First, it fails to recognize the extent to which setting standards on any test involves 
policy decisions and judgments rather than a tightly-defined technical process to find one 
"accurate" score. Second, in contrast to what the report asserts, standards often are set on tests 
quite similar to NAEP through the same system of collecting judgments-trie Angoff procedure- 
that NAGB used. 

The standards on NAEP, as on similar exams, denote a general, overall level of 
proficiency; they are not intended to be a specific check-list of skills. The standards are 
valuable for interpreting the significance of NAEP results-far more valuable than any skill 
check-list could ever be. 
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See comment 12. 



See comment 13. 



See comment 14, 



Repeatedly, the report complains tha; 'he achievement levels are not "accurate" and do 
not "measure' 1 what they are supposed to. In fact. u>e measuring in NAEP is done by test items 
and the scale used in reporting results. The achievement levels, like any standards, are a series 
of judgments about how to interpret these results. I: is reasonable to debate whether they are 
too high or too low, but to say they are "not accurate," as if some "true" levels exist, is absurd. 

As Richard Jaeger writes. "All standard-setting is judgmental. No amount of data 
collection, data analysis, and model building can replace the ultimate judgmental act of deciding 
which levels of performance are meritorious or acceptable and which are unacceptable or 
inadequate. " (Jaeger, R.M., "Measurement Consequences of Selected Standard-Setting Models. " 
1979.) 

The central issues that should be considered are whether the judgments themselves are 
informed and defensible and whether they are arrived at through a careful process. For NAEP 
the judgments ultimately are made by the Governing Board, but they are informed by a wide, 
public consultative process, as well as by the c?.reful judgments of broad-based panels of experts, 
most of whom are classroom teachers. 

The Board believes that the achievement levels in mathematics contain reasonable 
descriptions of what students should learn in the different grades. GAO does not challenge this 
view, and, as the report even notes, the levels for 1990 received strong support from educators 
and policy-makers who considered them, as shown in surveys conducted for NAGB. 

Far from being "novel," as the report claims, the Angoff procedure to apply expert 
judgments to the NAEP exam is the most widely-used standard-setting process in the nation. 
It has been implemented on hundreds of tests over two decades. The first to suggest using it on 
NAEP was Albert Beaton, one of the leading theoreticians in the development of NAEP for 
Educational Testing Service. The specific design for 1990 came from rational experts in 
measurement. As the process developed, it also took into account recommendations by a panel 
of state testing directors. 

The design for 1992 was prepared by ACT, which has used variants of this system on 
dozens of examinations. Again, there was wide input and review by national experts, including 
staff of the National Center for Education Statistics. To suggest, as GAO does, that NAGB. a 
largely non-technical, policy group, took it upon itself to develop a standard-setting process is 
simply not true. Yes. the Angoff procedure was "modified;" it nearly always is adapted to 
specific situations. 



Validity 



In discussing validity, the GAO report is confused, using an outmoded "all-or-nothing" 
approach. For example, the report says "valid measurement depends, ultimately, on having a 
measuring instrument and a method suited to the purpose." Yet. the most influential recent work 
on the subject, by Samuel Messic^. of ETS, explains that validity concerns "not the test or 
observation device as such but the inferences derived from test scores or other indicators." 
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See comment 15. 



See comment 16. 



As Messick defines it. validity is "an integrated evaluative judgment of the degree to 
which empirical evidence and theoretical rationales support the adequacy and appropriateness 
of inferences and actions based on test scores or other modes of assessment. " He continues, 
"Validity is a matter of degree, not all or none.. ..[It] is an evolving property and validation is 
a continuing process." (Messick, S. "Validity" in Education al Measurement f3rd. Edition!. 1988) 
Thus, the GAO report misstates an important point when it says validity is a property of the test 
and not a characteristic that applies to the interpretations of test scores. 

When it discusses whether NAGB followed "a valid process" in setting achievement 
levels, virtually all material in the report deals with the process used for 1990, and even that is 
limited. When it deals with valid inferences from the results, ail of its evidence relates to 1990- 
not the levels for 1992. A number of validity studies of 1992 achievement levels are underway. 
Such studies should continue and more should be done in future years. The Board expects the 
information they provide will be useful in developing standards for future NAEP assessments. 



Significant Changes for 1992 / 

As explained in the attached response by \CT, there were significant changes in how the 
Angoff procedure was carried out on the 1992 assessments, compared to 1990: 



• more opportunity for public input 

• increased technical support 

• an elaborate procedure for selecting judgment panels which assured broad 
representatior nationwide 

• development of operational definitions before individual test items were rated 



• increased training and feedback for panelists 

• increased time for panel deliberations 

• a pilot study of the entire process 

• a built-in reliability check by splitting the tern pool between groups of judges 

These seem to have met many of the specific concerns that GAO has raised. Yet, the 
report only briefly examines the changes. Instead, it simply asserts that no matter what 
improvements may be made in procedure, NAGB's approach is "inherently unsound" because 
the nature of the National Assessment precludes it. 
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The reason for this "fundamental flaw" is not clearly explained in the report. It seems 
to be that performance standards which refer to content mastery can only be properly set on a 
specially-designed criterion-referenced test. An example might be the "mastery exams" used in 
many grade-school classrooms to determine if a particular type of addition problem has been 
learned, although GAO does not spell that out. 

It is true, of course, that there can be no skill-by-skill confirmation of what students 
know unless they take a test designed to give skill-by-skill results. NAEP can't do that-nor can 
most widely-administered exams. But that's not what NAGB's achievement standards are trying 
to do. 



See comment 17. 



See comment 18. 



The central language of the standards, used in the definition of proficient, is taken 
directly from the National Education Goals: that students should have "demonstrated competency 
in challenging subject matter." This is the "solid academic performance" which the proficient 
level is intended to represent. Clearly, the achievement standards, like the Goals, refer to a 
general degree of attainment. That is what NAEP is meant to show, as GAO explains. 

The text of the subject-matter descriptions which are part of the achievement levels is 
written in general terms to describe the sorts of skills and knowledge that students at each level 
should havr It is derived from the test framework and is useful, as GAO recommends, in 
giving judges a common basis for rating test items and in explaining to the public the level of 
competence expected. 

Contrary to what the report asserts, the achievement levels are meant to be a general 
performance standard on the NAEP scale. NAGB is not trying simultaneously to make detailed 
specifications of content mastery. For GAO to say so is based on a strained misreading of 
Board policy, not on the achievement levels as they have actually been developed. In essence, 
the report creates a straw man that it can easily knock down; thus, much of its lengthy 
discussion is irrelevant. 

What is relevant is that most standard-setting in American schools and professions is done 
on tests, like the National Assessment, that contain a wide range of items relevant to the 
standards being set. In some cases the standards themselves may be part of the test design, and 
NAEP may evolve in that direction. But this type of test development is not a precondition for 
meaningful standards. As a matter of fact, what the Board has done in setting overall 
performance standards is analogous to the five levels of performance on Advanced Placement 
tests and to what millions of teachers do every day in grading examinations. 

Again, although one may criticize how NAGB went about setting standards on NAEP, 
there is no "fundamental flaw" in setting them. Indeed, neither ETS nor ACT-two of the 
nation's most prestigious testing organizations-has ever brought to the Board's attention the 
point that GAO makes even thougn both were involved in developing the methods for setting the 
standards and for placing them on the NAEP scale. 
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Reliability 




In its verv brief treatment of reliability, the report makes no distinctions among the 
various types of error that may be part of a large-scale assessment. There is error associated 
with the assessment itself, error associated with the sample of students being assessed, error 
associated with a standards panel's judges, and finally, error associated with the estimates of the 
proportion of students that meet each standard. 




Setting achievement levels has no impact on the first two sorts of error estimates. These 
exist with or without standards and to the same degree. 


See comment 19. 


Regarding the judges* estimates for 1990, no definitive estimates of error can be made 
because of the lack of two independent samples of judges' ratings. Therefore, whatever 
conclusions the report has reached can only be surmises, for which no one has data to prove or 
disprove claims about reliability. 

Although reliability data for 1992 are not yet complete in all subjects, analysis so far 

„„ c ™iiahiiitv in math k hi ah according to ACTs technical report. For 1992 the item 
indicates reiiaouiiy in mam « mgu, a\A.uiuiuij * r 

pool was divided in half, permitting comparisons between panels of judges. 

From Judgments to the Scale 


See comment 20. 


One particular point in the GAO report is baffling. The report asserts that an 
"inappropriate method" was used to transform the judgment panels' recommendations into scores 
on the NAEP scales. Yet, for 1990 the method of going from an expected percent correct to 
a scale score was developed by ETS which created the scales; for 1992 it was done by ETS and 
ACT, both of which concurred in what to do. GAO presents no evidence that the procedure 
yields incorrect results. 

This part of the process is technical work: it was carried out in close consultation with 
NCES. For GAO to dispute it with virtually no explanation or analysis is unwarranted. 


See comment 21. 
Now on page 31. 

See comment 22. 


Also the table GAO has prepared (on page 2-18), purporting to show a discrepancy 
between item-by-item judgments of the panelists and the actual 1990 results, is flawed. It relies 
on extrapolations rather than actual data, and is based on a mistaken premise, discussed above 
that achievement levels are meant to be skiil-by-skili specifications of mastery rather than general 
descriptions of degrees of mathematics proficiency. Since the goal was to set standards on the 
NAEP scale, it was appropriate for judges to rate all items contributing to the scale, not just 
those "at the level." 


See comment 23 


Even so, there is considerable agreement between the judgments and results, which tends 
to confirm the process NAGB used, not discredit it. The small differences that do appear seem 
to reflect a regression to the mean-with students at the top scoring somewhat lower than 
expected and those near the bottom doing better. 
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See comment 24. 



Now on page 45. 



See comment 25. 



Of course, there should be a good match between the judges' item judgments and the 
actual pattern of results because the judges knew the overall percent correct for each item as they 
made their judgments. In effect, easier content tied to easier questions is the basis for the lower 
achievement levels; harder content tied to harder questions is the basis for the higher 
achievement levels. Both the descriptions and the degree of difficulty of the questions follow 
the logical development of the subject being tested. 

Throughout the report. GAO seems to assume that descriptions of overall performance 
standards should make no reference to the particular kinds of performance that might be 
expected of those who meet them. This is mistaken. As Michael Kane, of the University of 
Wisconsin, has pointed out. in licensing and competency exams such references are usually 
made, just as they are in NAEP. The intent is not to guarantee skill-by-skill mastery but to 
provide what every standard needs-a clear indication of the level of competence expected. 

In part of its report. GAO seems to understand this point well. It even suggests using 
the Angoff method (on page 3-12) to set "overall performance standards" on NAEP, provided 
there is "clear... specification of what students at ^ach level should be able to do." That 
essentially is what ACT did in recommending the levels to NAGB for 1992. Yet t the report 
continues to reject them. 



Scaling on NAEP 

_ 

The report raises two questions related to NAEP scaling. First, how can the standard- 
setting process include all items yet "arrive at a standard that validly reflect(s) mastery of only 
some items?" And second, doesn't it matter whether or not students get a certain percent 
correct by the same pattern of answers expected by the standard-setting panels for a particular 
level? 

The implication of both is that NAEP scaling procedures are faulty. But in fact. NAEP 
scaling works in such a way that »t is always possible to relate a percent correct score to a 
NAEP scale point through a mathematical relationship called the "test characteristic curve." If 
GAO had evidence that NAEP scaling is flawed, these might be reasonable concerns. However, 
there have been numerous studies demonstrating its soundness and no published evidence to the 
contrary. 

The report also raises a question about the kind of items-easy or hard-that students at 
the basic level are expected to answer. This. too. is a scaling issue, not an evaluation of the 
achievement levels First. NAEP does not provide scores for individual students, and second, 
no student takes all test items. Of course, there are virtually infinite patterns of answers that 
could lead to a particular score. But on average-and that is what NAEP scaling allows-a score 
of 255, for example, is most likely to reflect the progression of easier to harder questions used 
in setting the eighth grade basic level at that point. 
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See comment 26. 



See comment 27. 
See comment 28. 



See comment 29 



Many of these issues apply to the descriptive anchor points as well as to the judgmental 
achievement levels. Yet. the report praises anchor points as "straightforward" and 
•'serviceable"--without analyzing how they are developed-whiie criticizing achievement levels, 
placed on the same scale, as "misleading." 



A re the Standards Too High? 

Despite the report's lengthy critique of NAGB's standard-setting method, one of its 
principal objections'seems to be that the standards themselves are "unreasonably high.". The 
type of standards that should be set. the report says, must be "reasonable and appropriate for 
the general student population of the United States." 

It gives one example of what it has in the mind: the test of General Educational 
Development (GED) required to qualify for a high school equivalency certificate. For that exam 
the passing score is set at the 30th percentile of a national sample of high school seniors. 

The standards on NAEP are not that kind of norm-referenced standard. Deliberately, 
they are not based on current performance, but, like the national education goals, are meant to 
be "challenging," to serve as goals of what achievement ought to be. 

On the NAEP math exams of 1990 and 1992 only about 15 to 25 percent of students 
in the national samples reached the "proficient' 1 standard; just 1 to 4 percent were "advanced:" 
and slightly less than two-thirds met the partially proficient standard called "basic." As GAO 
notes, in recent years about 60 percent of American high school graduates have gone on to 
college. Yet. more than half of them drop out without ever graduating. 

Almost by definition, if a standard is challenging, relatively few students can reach it 
now. For example, in Kentucky's new statewide exams, only about 10 percent of students 
reached the proficient level in any subject. In Maryland about 20 to 25 percent made the 
"acceptable" level of 3 or better. In the fust pilot tests of the New Standards Project-even 
though the reliability of scoring was less than desired-only about 25 percent of students reached 
the criteria for passing. 

Each of these tests was scored independently of the National Assessment. Yet, each was 
commined to high standards. And eacl. had quite similar results. In their top categories-similar 
to NAEP's "advanced"-the proportion of students ranged from 5 percent down to less than 0.5 
percent. 

By GAO's calculations, only 10 percent of IVyeai-olds in Taiwan and 5 percent in China 
and Korea can reach the 1990 NAEP advanced level in 8th grade math. The American figure 
was 1 percent, indicating that ten times as many Taiwanese youngsters seem to have met the 
standard as Americans. Yet. if the U.S. is to be "first in the world" in math and science by the 
year 2000. as the National Education uoals proclaim, surely as many American students should 
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See comment 30. 



See comment 31. 



reach this level as those in Asia. To call the standard "extreme," as GAO asserts, is 
unwarranted. And, of course, that is a policy judgment which might properly be made by a 
policy board--not a "technical issue." 

Also, it is uncertain whether GAO's calculations are correct. A full-scale study linking 
N AEP to the International Assessment of Educational Progress is scheduled for release later this 
year. 

Of course, one way to get the external validity GAO seems to want would be to set 
standards based on actual performance-perhaps tying advanced to the top 10 percent of students 
and proficient to the top half or third. But an approach such as that would not really yield 
standards at all, just norm-referenced descriptions of performance. The proportion to exceed 
a standard would be known in advance because the standard itself would guarantee the 
proportion. 

In that case there would be no reason for anyone to deliberate about what students should 
know; the only decision to make would be what proportion should be given what label. There 
would still be the task of describing what students do know at the pre-selected points on the 
curve of results--a job surely for experts, but with no credible claim to be setting standards. 



The Role of the Governing Board 



The National Assessment Governing Board receives a great deal of technical advice and 
certainly is not isolated from testing experts, as GAO suggests. Indeed, to set achievement 
levels for 1992 the Board contracted with ACT, which, in turn, was advised by a panel of 
experts who helped shape procedures and strongly endorsed them. Also, two members of the 
Governing Board itself are testing experts, appointed to four-year terms last October. 

But the Board, which by law must be bipartisan, also includes governors, state 
legislators, school superintendents, teachers, and members of the public. And it is the Board 
ar a whole that must sift through ail the advice it receives to make policy judgments. That is 
•Jie proper role of a broadly-representative, policy-making Board. 

Creating such a board for NAEP was one of the principal recommendations six years ago 
of a major study commission, whose members included prominent education policy makers and 
psychometricians. The commission also called for state-by-state NAEP testing and achievement 
goals. It said such an expansion would be widely accepted only if the governance of NAEP 
"reflect(ed) an array of education, measurement, and policy perspectives. ■ 

Having an independent board set policy for NAEP is very rr.uch in the traditions of 
American education and democracy. In virtually all parts of the country lay school boards 
determine policy for state education departments and local schools, not commissioners or 
superintendents. Members of ail these boards lack the expertise of full-time professionals. That 
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is precisely why boards are created-to represent the various publics involved, to give legitimacy 
to decisions, and to bring a much broader view to bear than the interests of particular 
professions. 

If NAEP had remained low-impact and low-profile, it might well continue to be run by 
federal officials and contractors. But as the importance of the National Assessment grows, the 
role of its Governing Board becomes more crucial. 



See comment 32. 



ERJC 



Response to Specific GAO Recommendations 



The Governing Board should withdraw its direction to the National Center for Education 
Statistics that the 1992 NAEP results be published primarily in terms of [achievement] levels. 
The conventional reporting format should be retained until an alternative has been shown to 
be sound. 



Response: The Governing Board does not agree that the 1992 achievement levels should be 
withdrawn. The GAO recommendation is unwarranted. 

The report asserts that there are such "inherent flaws" in the procedure used to set the 
levels that no amount of improvement is sufficient. As our comments indicate, GAO has 
misunderstood the nature of standard-setting on NAEP and applied inappropriate criteria and 
methods in evaluating it. Further, the Angoff method used by NAGB is the most widely 
employed and evaluated standard-setting procedure in the nation. In response number three 
below, we cite the unanimity of expert opinion given t*> the Board that the Angoff method is 
appropriate for use on NAEP. The attachment from American College Testing (ACT), our 
achievement levels contractor, describes changes that have been implemented on the 1992 
assessments. 

The achievement levels are part of a desirable shift, underway in many parts of American 
education, from norm-based to standards-based testing. They will improve the usefulness of 
NAEP to the public and policy-makers. 

The Board has no objection to the second part of the recommenda*^n that NAEP's 
"conventional" reporting formats, presumably averages and anchor points, continue to be used. 
NAGB has never maintained that achievement levels be the only way of reporting NAEP results. 

We note, however, that in supporting continuation of the anchor points GAO gives 
approval to an approach which it has not evaluated. The report fails to consider criticisms of 
this approach that have been made in educational measurement literature. Even in using the 
term "conventional," the GAO implies that the validity of the anchor points has been established 
when in fact it has not. 
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See comment 33. 



See comment 34. 



See comment 35. 



2. The Governing Board and NCES should develop a joint plan and schedule for review of the 
achievement levels. This should include a determination by the Commissioner of whether (1) 
the approach is so technically and conceptually flawed that its results should not be proposed 
for publication, or, (2) the approach is sufficiently promising that preparations for NCES pre- 
publication review should be designed and implemented. 



Response: NCES has been and will continue to be closely involved in the process of setting 
achievement levels. Reports containing the 1992 achievement levels have gone through the 
NCES pre-publication review. This review has dealt with reporting issues, including the 
question of what inferences can properly be made about what students can do based on^ihe 
achievement levels. ^ 

However, setting standards is primarily a matter of policy judgment, not of statistical 
quality. Determinations in this area should be made by NAEP's policy board, not by a statistical 
agency. 



3. The Governing Board should withdraw its approach of applying achievement levels to fuwrc 
NAEP tests and join with NCES in exploring alternatives for setting both content-based and 
overall performance standards with respect to NAEP. 



Response: The Governing Board, together with NCES. has already explored alternatives for 
setting achievement levels and will continue to do so, but it believes the Angoff method has 
worked well. 

From the initial planning to the present, both NCES and Educational Testing Service, the 
NAEP contractor, have been involved in the achievement levels process. In fact, the late 
William Angoff, then a senior research scientist at ETS, was consulted and indicated that this 
method would be suitable. 

At a meeting in December 1991. jointly sponsored by NAGB and NCES. other 
approaches for setting achievement levels were explored. The view of the group- which 
included psychometricians from ETS and ACT and other testing experts-was unanimous: While 
other approaches exist, the Angoff method is the most widely used and thoroughly evaluated; 
no other approach is better overall for setting achievement levels for NAEP. 

The Board will continue to be open to consideration ot other standard-setting approaches. 
Unfortunately, those mentioned in the report appear naive and unsupported by research evidence. 
However, NAGB believes that over time achievement standards should become a primary facte, 
in preparing NAEP as well as in reporting it. The Board will not rule out proposals for other 
approaches in future competitions to select an achievement levels contractor. 
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4. (a) Congress should specify what it intends in directing the Governing Board to identify 
appropriate achievement goals: whether it envisions the establishment of overall performance 
standards, the establishment of content-based performance standards, or simply better alignment 
of test coverage with content mastery standards. 

(b) Congress may also wish to express specific guidance with respect to aligning NAEP with 
any national content standards as they come into existence, given that it is considering 
legislation to establish a mechanism for adopting national content standards. 



See comment 36. 



Response: As Congress considers pan (a) of this recommendation, the Governing Board 
submits that the distinction made by GAO between "overall" and "content-based" performance 
standards is drawn more sharply than warranted in practice. The report assumes, incorrectly, 
that overall performance standards make no reference to the kinds of performance that might be 
expected of those who meet the standard. In the standard-setting field, overall performance (i.e. 
a "passing" score on a test) is almost always associated with a clear indication of the level of 
competence expected. . The achievement levels include particular scores on the NAEP scale, 
general descriptions of the content represented by that level, and illustrative test items. "Overall" 
and "content-based" standards are not mutually exclusive in their application, as GAO maintains. 
No fixed definitions should be enacted in law that would preclude flexibility and improvement. 

(b) The Board believes that the content of NAEP should reflect both current and evolving 
instructional practice. It should neither be determined by nor ignore voluntary national conten' 
and performance standards as they are developed. Instead, a balance should be achieved, 
through the national consensus process used in developing assessment frameworks, that 
appropriately aligns the National Assessment with the standards over the course of successive 
administrations of a subject area assessment. 
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5. To ensure against technically unsound policies or technically flawed results, GAO recommends 
that the Board: 

(a) Obtain NCES review of policies proposed by the Governing Board prior to final decision; 

(b) Analyze the probable effect of proposed policies on NAEP's ability to present achievement 
fairly and accurately and to support valid, reliable trend reporting; 

(c) Pilot test and thoroughly evaluate any new design or analysis procedure before it is fully 
implemented and results reported; and 

(d) Adopt standards of tecjinical quality (to be applied internally) for publications issued under 
its own authority and secure competent external technical review of such publications prior to 
authorizing their release. 

6. GAO recommends that the Governing Board review actions it has taken with respect to is 
statutory responsibilities in the past two years, identity those whose technical consequences 
have not been sufficiently examined, and secure technical review as necessary to ensure these 
actions will not generate unanticipated technical difficulties in the future. 

7. GAO recommends that NAGB review each proposed policy for conformity to its "Policy on 
Policies'* to ensure that the Board prescribes policy ends, not technical details of 

i implementation. 

I 



See comment 37. 



See comment 38. 



Response: Although the Governing Board does not object to the general direction of these 
recommerKiations, it rejects the implication that they are bawd on substantiated findings of 
failure on its part with respect to technical matters. 

The GAO report includes three studies of the Board's handling of technical issues. In 
two cases it commends NAGB's actions, finding that it recognized the need for technical advice 
and properly considered it. In only one case- -the 1990 achievement levels-does GAO conclude 
that the Board failed to recognize that it needed technical advice. In fact, however, extensive 
technical advice was sought and obtained from inception to conclusion of that project. For 1992, 
the Board contracted with ACT, a respected standard-setting organization, to conduct the 
achievement levels process. Also, NCES and ETS have been closely involved throughout the 
achievement levels process. The assertion that NCES and ETS have been systematically ignored 
and uninvolved is untrue. 

The report is mistaken in suggesting that the Board has not properly followed its "policy 
on policies" with respect to achievement levels. Since setting "appropriate achievement goals" 
is a direct responsibility of NAGB under the statute, the Board is responsible in this case for 
both policy ends and means. Its -policy on policies" is intended for activities carried out by 
NCES and the NAEP contractor where it is appropriate that only the ends be prescribed. It is 
illogical to suggest that this policy should also apply to the Board in carrying out its own 
specified duties. 
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8. GAO recommends thai NAGB nominate for the two testing and measurement positions only 
persons with relevant professional qualifications, who are trained and experienced in the design 
and analysis of large-scale educational tests. To further add technical expertise within its 
current membership structure, NAGB should also ensure that two or more of the elected 
officials, educators and representatives of the general public appointed to the Board have 
significant technical knowledge and experience. 



Response: The Governing Board will continue to solicit recommendations from appropriate 
individuals and groups and will nominate well-qualified persons for Board vacancies as they 
occur. Professors Jason Millman and Michael Nettles, the members serving in the positions for 
testing and measurement experts, are eminently qualified. They were chosen last October, as 
required by law, from a list of qualified individuals nominated by the Board. 

The Board seeks strong nominees in all categories of representation, and regularly solicits 
recommendations from more than 700 organizations and individuals. Many of its members have 
considerable experience with testing issues and all become well-informed. It would be unwise, 
however, to limit the* representativeness of other categories of membership, e.g., the general 
public, to provide more representation in a category already having two specified positions on 
the Board. 



9. Congress should clarify the division of responsibilities between NAGB and NCES, with a 
view toward concentrating NAGB's efforts on representational functions for which the Board 
is well designed. While NAGB as it is currently constituted can appropriately advise the 
Commissioner from a constituency perspective regarding functions that are technical (such as 
methods and design of the assessment), the Board does not have the technical resources to 
carry out these functions and should be relieved of this responsibility. When Congress has 
more clearly determined what NAGB's functions should be, it should review NAGB's 
membership and increase the number of technically trained members as needed. 



See comment 39. 



See comment 40. 



See comment 41. 



Response: As Congress considers this recommendation, the Board submits that GAO has 
drawn a conclusion about the capability of NAGB that is not supported by the facts. Although 
it criticizes NAGB in respect to achievement levels, GAO gives short shrift to the two other 
policy areas in which the report commends Board actions. It also ignores the consistent success 
of one of our major policy/technical responsibilities— conducting the national consensus process 
by which the framework for each assessment is developed. Certainly, planning for assessments 
in subjects such a? reading and history provides fertile ground for controversy, yet no mention 
is made of the Board's positive record in these areas. 

Also, the report makes too stark a distinction in suggesting that technical and policy 
issues are discrete and easily separable. In truth, both policy and technical matters are involved 
in almost every issue of importance to NAEP. A sharp division between them would be unwise 
if NAEP is to have a governance structure with the strong checks and balances needed to 
maintain its independence and integrity. Because the impact of the National Assessment if wide, 
it is essential that NAEP have a strong, independent policy-making board, representing the wide 
range of interests it affects-not a weak advisory committee, as GAO suggests. 
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Trie following are gao's comments on the National Assessment Governing 
Board's March 23, 1993, letter. 



GAO Comments 



1. nagb misstates our position as advocating "returning solely to the older 
methods of reporting" as opposed to reporting in terms of performance 
standards. In fact, we conclude in chapter 3 that overall performance 
standards can usefully be established for naep. We recommend that 
conventional descriptive naep reporting be retained for now simply 
because no satisfactory standards-based alternative is yet available. 

2. nagb again describes our position incorrectly. We do not argue that 
trying to set standards on naep is conceptually flawed: we find that nagb's 
particular approach to doing so is conceptually flawed. As we explain in 
chapter 2, the conceptual flaw is that nagb tries to do two things at once. It 
establishes standards of overall performance on a broad-based test but 
seeks to interpret the resulting scores as evidence of what students should 
know, not just of how well they should do on the test as a whole. 

3. This jeneral criticism summarizes specific comments on later pages. We 
respond to these comments in items 7 through 31. 

4. nagb's comment that our report focuses primarily on the early phase of 
the 1990 standexd-setting is in error. Both the summary description of 
nagb's approach in chapter 1 (see table 1.1) and the analysis in chapter 2 
are based on the procedures actually used in setting the 1990 standards. 
We have added a footnote early in chapter 2 to make this clear. The report 
also considers the 1992 process and credits the improvements made. We 
have added to our discussion of the 1992 procedures to incorporate new 
information provided in nagb's comments. 

5. In this and the following paragraph, nagb argues the importance of 
setting standards that are based on what students "ought to know and do" 
and on "setting clear markers of what students should learn" and remarks 
that we seem "cool" to this idea. We do not oppose basing the setting of 
standards on what students ought to know (what we call content-based 
performance standards). We conclude, however, that nagb's approach is 
not suited to this purpose, nagb itself seems to deny that it was its purpose 
to set standards of what students ought to know (see items 14 and 15): it 
says the achievement levels represent general d agrees of attainment. This 
ambiguity concerning what the achievement lev els represent lies at the 
heart of the problem with nagb's approach. 
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6. The remaining material on this page summarizes points naGB presents in 
more detail on subsequent pages. We respond to these points in 
connection with their later treatment. 

7. nagb criticizes us for using "statistical qu?lity" as a criterion for 
evaluating a policy judgment. We continue t ) believe that technical 
soundness (we do not use the term "statistical quality") is pertinent. When 
policy judgments are based on data as nagb's were, the quality of those 
data and the adequacy of the measurement procedures that; underlie them 
are a legitimate concern. If the data are of poor quality or are based on a 
measurement procedure ill-suited to the purpose it is intended to serve, 
judgments based on them will be poorly informed and may lead to 
unwarranted conclusions and interpretations. 

8. nagb is incorrect in inferring that the use of words such as "realism" is 
evidence of our preference for not expecting much from students. We take 
no position respecting what shouid be expected of students. Rather, we 
take the expectations stated in nagb's levels definitions as a given. We do 
use the word "realistic" with reference to the item judgments (chapter 2). 
Our concern is that the score selected for each level be a realistic (that is, 
well-informed) estimate of how students actually at that level would be 
likely to perform. We use the word "reasonable" (in the meaning of 
sensible or valid) to apply to the interpretation given to the test scores 
selected as standards. These usages are drawn from the literature on item 
judgment methods. 

9. nagb asserts that we fail to recognize that setting standards on any test 
involves policy decisions rather than simply a technical process that finds 
an "accurate" score. This misstates our position on two counts. First, we 
state clearly in chapter 3 that standard-setting is a matter of informed 
judgment. Second, we argue that such decisions should not rely solely on a 
technical procedure (no matter how well designed). The scores that 
emerge from that procedure should themselves be judged in the light of ? 
variety of evidence to ensure that they are "accurate" in the sense that they 
represent the kind of performance the decision body has in mind. 

10. We do not assert that the use of Angoff procedures is uncommon or 
that these procedures could not be applied to naep. We find weaknesses in 
nagb's particular application of those procedures and infr rpretation of the 
resulting test scores. 
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11. The question of whether the nagb standards denote a general level of 
proficiency or whether they also denote what students at each level know 
and are able to do is the central question of interpretation to which our 
report is addressed. We do not treat nagb's achievement level descriptions 
as a "specific check-list of skills." 

12. We agree that standard-setting is a matter of informed judgment. We 
consider nagb's judgment insufficiently informed because (a) judgment 
panelists lacked necessary information and (b) nagb did not examine 
whether the scores it selected could validly be interpreted as representing 
the achievement levels it described. 

13. The question is not (as nagb states) whether the item judgment process 
was defensible or the descriptions of what students should know were 
appropriate: it is whether those scores represent that knowledge. We 
found the preliminary evidence unconvincing, and nagb has not offered 
evidence of its own. nagb consulted experts with respect to the item 
judgment results and the knowledge paragraphs before the naep scores 
were selected, at a point when the question of interpretation could not be 
addressed. 

14. nagb argues here that the achievement levels approach was a sound 
adaptation of a common method. We do not claim that the Angoff method 
is novel, nagb's use of it to apply the achievement levels definitions to naep 
was novel. (Albert Beaton recommended a quite different variant, much 
earlier.) nagb asserts that it did not develop the standard-setting process. 
However, the record shows that nagb did make the decisions that turned 
out to be critical to its approach, nagb defined the achievement levels, 
determined that panelists should make prescriptive rather than realistic 
judgments, specified that the resulting naep scores should be interpreted 
in terms of skills that students should have, and decided to proceed 
without seeking to validate its interpretation, nagb has not disputed these 
facts. 

15. nagb is correct in observing that thinking about validity has evolved 
rapidly since the professional standards we cited were written and that 
Messick presents the current view. We believe the substance of our draft 
report was consistent with that view, although we used older terminology. 
(For example, we asked whether nagb's measurement method suited its 
purpose, which was to represent the achievement levels as nagb defined 
them. Messick would ask whether there is a theoretical rationale for 
supposing that naep scores can be interpreted in terms of these 
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definitions.) We have revised the material in chapter 2 to make clear that 
our focus is on validity of inference. 

16. nagb believes that improvements in item judgment procedure for 1992 
have answered our concerns. We note in our report that item judgment 
procedures for 1992 were improved. However, while these improvements 
may lead to more reliable score estimates, they will not solve the problem 
of interpretation. In fact, nagb's insistence that test scores be interpreted 
in terms of panelists' operational definitions of what students should be 
able to do — without any reference at all to what students at those scores 
actually can do — raises new problems. 

17. On this page, nagb describes the achievement levels as general degrees 
of attainment and states that its intent was to set general performance 
standards on the naep scale, not detailed specifications of content mastery. 
This is a useful clarification of nagb's intentions. Had nagb presented the 
achievement levels in this way (that is, simply as its judgments of a how 
much is good enough" to be considered marginal, solid, and superior 
performance on the naep test), validity issues would not have arisen. 
However, nagl sought — and by the evidence of earlier comments still 
seeks — to interpret the achievement levels in terms of what students 
should know (see item 5 above). We treat nagb's interest in the mastery 
dimension of the achievement levels seriously because nagb itself has 
done so. 

18. nagb draws an analogy between the achievement levels and the levels 
associated with advanced placement exams. As illustrated by the subject 
of calculus, the advanced placement levels and the process by which they 
are set are in fact very different from nagb's levels and procedures. Each 
advanced placement test reflects a specific course outline and assesses 
only relatively advanced skills. The five levels or grades are (1) no 
recommendation, (2) possibly qualified, (3) qualified, (4) well qualified, 
and (5) extremely well qualified. These levels represent different degrees 
of performance: no description of the skills associated with each of these 
levels is provided. 1 Boundary score zones between levels are set by 
reference to score statistics from past tests and evidence of how students 
in comparable college courses perform on the test. Test papers whose 
score falls at the boundary between two levels are examined to see how 
the examinee got that score: whether by doing very well on simpler items 
or by showing skills pertinent to the higher level. 



'There are descriptive standards for performance on specific questions {lo guide the scoring of those 
questions) but not for performance on the test as a whole. 
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19. nagb comments that any conclusions about the reliability of the 1990 
item judgments are "surmises" that no one has the data to disprove. The 
design of the 1990 item judgment process did indeed preclude a full 
assessment of the reliability of the judgment results. We quote nagb's 
finding (included in its technical report on the project) that such evidence 
as was available suggested some problems. 

20. nagb states that we present no evidence that the procedure used to 
transform item judgment results to a naep score produced incorrect 
results. We agree that the procedure is accurate in this respect: if 

48 percent is the standard, it locates a score on the naep scale at which 
students get 48 percent correct. Our question was whether students at this 
score got this percent correct by answering questions appropriate to their 
level. We have revised our presentation to make this more clear. 

21. We are baffled by nagb's comment that our analysis of naep results is 
flawed because it is based on "extrapolations rather than actual data," In 
fact, our analysis (see table 2.2) is based on data that nagb itself presented 
to illustrate student performance at the 1990 achievement levels — data 
that nagb describes as^the percentage of students in this group who gave 
the correct answer to the item." 2 Data computed in just the same way have 
been the basis for reporting student performance at various naep score 
levels for years and were still being used in 1992. Of course, the naep 
scores themselves are extrapolations (that is, statistical estimates), as 
explained in appendix III. But as we understand it, the figures in the table 
represent the actual performance of students whose estimated naep scores 
fell within the specified range. 

22. Our analysis focuses on student performance on broad groups of items 
at different levels of difficulty, which is consistent with the mastery 
dimension of nagb's achievement level definitions. (For example, we 
assume that students at the "advanced" score level should answer items in 
the most difficult group at a rate consistent with panelists' judgments.) We 
are not concerned with skill-by-skill specifications. But we are concerned 
that if nagb purports to interpret naep scores in terms of item mastery (as 
in "partial mastery of fundamental skills"), the evidence should support 
this interpretation. 

23. nagb interprets the general agreement between judgment results and 
student performance as a confirmation of its procedures. We agree that 



National Assessment Governing Board, The Levels of Mathematics Achievement, vol. 2, State Results 
for Released Vems (Washington, D.C.: 1991), p. 38. 



ERJC 



Page 96 



GAO/PEMD-93-12 Educational Achievement Standards 



Appendix II 
Comments From NA6B 



performance at the basic score represents mastery of fewer and less 
difficult items than at the proficient level, which in turn represents 
mastery of xewer and less difficult items than at the advanced level. Our 
question was whether students at the scores selected met the mastery 
expectations (as well as the overall score expectations) for their level. The 
differences at the basic and advanced level seemed to us to be important 
from the point of view of the interpretations given to those scores, and 
especially to "below basic" scores. 

24. nagb perceives our report as being opposed to the use of descriptions 
of the kind of performance associated with a standard. Our position is that 
statements that describe expected performance at various levels are 
perfectly acceptable so long as they are demonstrably related to the actual 
performance exhibited at those levels. The problem with nagb's 
descriptions is that this relationship has not been established. 

25. We do not assert that the naep scale is flawed, nor do we argue that one 
cannot find percent correct equivalent scores on the naep scale. (See item 
20.) We are concerned that the point thus selected may not represent the 
mastery expected for a given level — that the pattern of answers exhibited 
at a score of 255, for example, is not the pattern expected for students at 
the basic level. Our report both presents the logic from which our concern 
arose and provides evidence that the two patterns are different, nagb 
asserts that this score is "most likely" to represent the basic level 
appropriately but does not support this assertion 

26. nagb asserts that the conventional anchor points (which we describe as 
"straightforward" and "serviceable") raise the same issues as the 
achievement level descriptions (which we criticize). We see several 
significant differences between the two reporting methods. Anchor points 
are scores that are multiples of 50. Since the basis of selecting these scores 
does not reflect expectations of how students should perform, the issue of 
whether an anchor point represents the performance it is supposed to 
represent does not arise. Anchor point descriptions are indeed 
descriptions: they are based on observed performance at each anchor 
point score, nagb's achievement level descriptions for 1990 combined 
prescription with description: they reflected a combination of expected 
and observed performance tor each level. For 1992, the achievement levels 
descriptions state what panelists thought students should be able to do 
and do not reflect observed performance at all. To present them as if they 
describe actual performance would be misleading. Both the anchor level 
and the achievement level paragraphs are limited in that they focus on a 
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selected aspect of perf ormance (that is, they are based on a limited set of 
items within a narrow range of difficulty) rather than on performance on a 
wide range of questions. 

27. See item 8. We have deleted the words "reasonable and appropriate for 
the general student population of the United States" from chapter 3 since 
they apparently were subject to misinterpretation. 

28. We present the ged example (minimum acceptable high school 
equivalency exam score set equal to the 30th percentile score earned by 
high school graduates) to illustrate a methodology. The use oi'this 
example does not imply either endo sement or criticism of the particular 
percentile selected. This methodology can be used to .set standards at any 
level, from minimal to very challenging. 

29. nagb criticizes our comparison of U.S. and international achievement 
data and questions our calculations. Our presentation of international test 
data reflects published results; we made no calculations. We have deleted 
the corrjnent characterizing the advanced standard as extreme. 

30. nagb mischaracterizes our example of how setting standards can be 
based on information about current performance. This comment 
represents a misconception that was also evident in nagb's May 1990 
achievement levels policy paper. Setting a percentile-based standard 
means selecting the test score earned by students at a given percentile in a 
base year, which could involve consulting a variety of evidence. For 
example, suppose nagb concluded from various indicators that, in 1990, 
about 5 percent of American 8th graders were performing very well in 
mathematics. This would suggest that the 95th percentile score on the 
1990 naep test, 317, might be an appropriate standard for advanced 
performance. Evidence of the naep performance of students whose 
estimated scores reached this level — and perhaps at the 90th percentile 
score as well — could be inspected. If performance at the higher (but not 
the lower) score appeared appropriately advanced, the score of 317 would 
be adopted as the standard. In 1992, scores of 317 or higher would be 
counted as advanced. 

31. nagb states that it "is not isolated from testing experts, as gao 
suggests." This is not an accurate description of our position. We 
recognize that nagb made increased use of technical resources after the 
problems with the 1990 achievement levels became evident. Our 
recommendations are aimed at ensuring that nagb draws on such 
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resources as policy objectives are being formulated and delegates the 
implementation of techr^jal procedures to appropriate experts. 

32. nagb argues that our recommendation that the achievement levels be 
withdrawn is unwarranted. Since we do not see evidence in nagb's 
comments that would change our conclusions about the problems of 
interpreting the achievement level scores, we see no need to change our 
recommendation. 

33. nagb's statement that the achievement leveis have "gone through" the 
1992 prepublication review that has "dealt with . . . the question of . . . 
inferences" is true but does not tell the whole story. The review took the 
achievement level scores as given and included them in reporting naep 
results. However, the 1992 naep mathematics report notes that the 
question of what can be inferred from these scores has yet to be resolved. 

34. nagb states that the creator of the item judgment method, Dr. Angoff, 
found the achievement levels method suitable. Dr. Angoff was consulted in 
December 1989, before the specifics of the nagb achievement levels 
approach had been developed. 

35. nagb terms our alternative methods of standard-setting "naive and 
unsupported by research evidence." The methods of setting overall 
performance standards we discuss in chapter 3 rest on informed collective 
judgments about what constitutes adequate performance on a 

test— judgments that can appropriately be made by a broadly 
representative governing body. Such judgments can be (and from what we 
have seen, typically are) informed by a variety of evidence, from 
performance-distribution data to item judgment results to inspection of 
student naep test papers and other work done by the students who 
produced these papers. As indicated in item 30, the simpler of these 
methods are not as naive as nagb thought them. 

36. nagb again argues that we do not recognize that overall performance 
standards reflect expectations of student competence. In chapter 3, we 
make clear that judgments with respect to levels of overall performance 
reflect notions of what students should be able to do. 

37. nagb believes that we have overlooked evidence that it sought 
technical advice regarding the 1990 achievement levels. We have added to 
the text of chapter 4 to make clear that nagb did seek and respond to 
expert advice with respect to the item judgment procedures and their 
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results after problems with its initial procedures had become evident. We 
also note that advice (including advice from nces and ets) was obtained 
throughout the process. Our concern is that nagb did not respond to this 
advice, an observation that nagb's comments do not dispute. 

38. nagb comments that the "policy on policies" does not apply to activities 
it implemented. We have amended the text of chapter 4 to respond to this 
comment. 

39. nagb argues that its overall performance is stronger than we suggest 
and that it has been successful in performing nontechnical functions. We 
have added a footnote to chapter 4 to make clear that our review covers 
only technical decisions and that our findings and conclusions do not 
reflect on nagb's performance in other areas. 

40 nagb suggests that we draw too sharp a distinction between technical 
and policy issues. In fact, we recognize that most naep issues have both 
technical and policy aspects. Given that nagb is not an expert body, we 
think its responsibility with respect to matters that are primarily technical 
should be only one of policy guidance, nagb can appropriately raise policy 
issues and propose policy objectives, but it does not have the knowledge 
resources to prescribe or implement technical solutions. 

41. nagb argues that a strong, independent policy board is essential to 
ensure that constituency interests are represented and that it should not 
become a weak advisory committee, as we suggest. We do not suggest that 
nagb should become a weak advisory committee; we simply suggest steps 
that could be taken to ensure that constituency interest is focused on the 
ends to be achieved through naer leaving it to experts to determine 
whether and how those ends can be accomplished. 
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NAEP Summary Description 



Statlltnrv Plirnn^P^ naep's statutory purposes are to assess and to describe performance in the 

^ ^ basic skills of reading, mathematics, science, writing, and history or 

geography on a regular schedule and in other subject areas as nagb may 
direct, naep must present achievement fairly and accurately, based on test 
results from a representative sample of students, and must report data in a 
manner that supports valid, reliable measurement of trends. 



TGSt ContGnt Traditionally, naep content has been designed to focus largely on the 

common ground of American education for each subject and grade — on 
the content that most students are taught. Tests have included relatively 
few items at the extremes of difficulty and relatively few that represent 
emerging practices. Beyond-grade items have been included in the 4th and 
8th grade tests, however, in order to permit naep to form a single 
proficiency scale that cuts across grade levels (see below). 



The skills and content to be covered by a naep test are identified through a 
broad-based consensus process, '\lso required by statute, which involves 
teachers, curriculum specialists, local school administrators, parents, and 
concerned members of the general public. The consensus group sets the 
outline or framework for the test. The 1990 mathematics assessment 
framework called for coverage of five content areas: (1) numbers and 
operations; (2) measurement; (3) geometry; (4) data analysis, statistics, 
and probability; and (5) algebra and functions. Questions also covered 
three kinds of abilities: conceptual understanding, procedural knowledge, 
and problem solving. Content was drawn primarily from elementary and 
secondary school mathematics up to, but not including, calculus. 

The test framework identifies the number of questions needed for each 
content and ability area at each grade level. A pool of items to implement 
that framework is designed, reviewed, and pretested A given framework is 
used for several administrations of the test. Each time the test is given, 
some old items are dropped and some new ones are added. Eliminations 
and additions are made in such a way that the framework is not 
significantly altered, enabling comparisons to continue to be made over 
time. 



^ti iH Ant ^arrmlin 0 NAEP tests a sam P le of students from a sample of schools. The student 

OLUUeill udllipuilg sample is designed to be nationally representative. For the national 1990 

mathematics assessment, the sample size was around 8,900 students per 
grade tested; about 6,400 usable responses per grade resulted. 
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Assignment of Items 
to Test Booklets 



In order to reduce the testing burden for students, naep divides the items 
for an assessment into a number of blocks and assigns the blocks to test 
booklets according to a complicated statistical procedure. Each student 
receives one booklet and therefore sees only some of the items on the test 
(For the 1990 8th grade mathematics test, each student saw around 59 out 
of 137 items.) The assignment of blocks of items and of students to test 
booklets is designed to ensure that nationally representative data are 
obtained for every test item. However, the sample of items that a given 
student sees is not necessarily representative of the test as a whole. 



In addition to containing the test items, test booklets ask students for 
information about themselves and their parents and about the school's 
educational practices. Teachers and administrators also fill out 
questionnaires. 



Scaling of Results 



Since different students see different portions of the test, raw scores do 
not provide a comparable basis for reporting student performance on 
naep. naep uses complex statistical procedures to estimate how each 
student or someone of comparable background and proficiency would 
have performed if he or she had seen all the items on the test. The 
procedure takes into account the difficulty of the questions the student 
saw and answered, not simply the number of correct answers, and is 
designed to be empirically accurate. The statistically estimated scores, not 
the students* raw scores, are the basis for the naep reports. 



"Rpnnrti n 0 ^ or ^ subjects other than writing, naep reports test results in terms of a 

" S 500-point scale that cuts across grade levels. The scale measures 

proficiency in the domain of elementary and secondary mathematics or 
other subject tested. Scores for 4th graders fall in the lower portion of the 
scale, scores for 8th graders in the middle portion, and scores for 12th 
graders toward the high end; there is overlap between grades. 

To help readers understand the meaning of the scale in item mastery 
terms, naep summarizes and illustrates the types of items that students are 
first able to master consistently — items that reach the 65 to 80 percent 
correct level — at "anchor points" located at 50-point intervals along the 
scale (200, 250, 300, and so on). The item summaries are called anchor 
point descriptions. The percentage of scores that fall at or above each 
anchor point is also reported. However, until 1990 naep did not relate the 
anchor points to either content or performance standards. 
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Governance and Administrative Structure 
for the National Assessment 



naep's governance structure includes two units: the National Assessment 
Governing Board and the National Center for Education Statistics. 



2 state legislators 

2 chief state school officers 

1 superintendent of a local educational agency 
1 member of a state board of education 

1 member of a local board of education 

3 classroom teachers 

2 curriculum specialists 

2 testing and measurement experts 

1 nonpublic ™hoeI axi^cinistrator or policymaker 

2 school principals 

1 representative of business or industry 

3 representatives of the genera public 

Formulating policy guidelines for naep 



Identifying appropriate achievement goals 
Developing assessment objectives 
Developing test specifications 
Designing the methodology of the assessment 



NAGB Membership 



2 governors or former governors 



NAGB 

Responsibilities 



Selecting subject areas to be assessed 
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Developing guidelines and standards for analysis plans and for reporting 
and disseminating results 

Developing standards and procedures for interstate, regional, and national 
comparisons 

Taking appropriate actions to improve the form and use of naep 

nagb also has final authority on the appropriateness of cognitive test items 
and is directed to ensure that such items are free from bias and that each 
learning area assessment has goal statements d through a 

national consensus approach. 



NCES Responsibilities Canying out NAEP ^ the advice of NAGB 



Ensuring that naep provides a fair and accurate presentation of 
educational achievement, uses representative sampling, reports trends 
reliably, and includes information on special groups 

Securing an independent evaluation of the Trial State Assessment 

Ensuring the technical quality of the published data 

Conducting reviews and validation studies of naep and soliciting 
comments on its conduct and usefulness 
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Appendix V 

NAGB Achievement Level Descriptions: 
1990 Mathematics 



T? W-Vi P j " BASIC: Partial Mastery of Knowledge and Skills . Fourth-grade students who are 

r OUlXIl VT1 3.Q6 performing at the basic level should be able to solve routine one-step problems involving 

whole numbers with and without the use of a calculator. They should also be able to use 
physical materials and pictures to help them understand and explain mathematical 
concepts and procedures. Students at this level are beginning to develop estimation skills 
in measurement and number situations and should understand the meaning of whole 
number operations. For example, students performing at the basic level should be able to 
link the meaning of multiplication with the symbols needed to represent it. These students 
aie also beginning to develop concepts related to fractions and read simple measurement 
instruments. Basic fourth-grade students should also be able to identify simple geometric 
figures and extend simple patterns involving geometric figures. These students should be 
able to read and use information from simple bar graphs." 



" PROFICIENT: Solid Academic Performance . Fourth-grade students who are performing at 
the proficient level should have an understanding of numbers and their application to 
situations from students' daily lives. The proficient student should be able to solve a wide 
variety of mathematical problems; use patterns and relationships to analyze mathematical 
situations; relate physical materials, pictures, and diagrams to mathematical ideas; and find 
and use relevant information in problem solving. Fourth-grade proficient student! should 
understand numbers and concepts of place value and have an understanding of whole 
number operations, as well as ajfaciJity with whole number computation. For example, 
students should be able to solve^proBlems with a calculator and have the ability to use 
estimation skills to solve problems. Proficient fourth-grade students should understand and 
use measurement concepts such as length; be abk to collect, interpret, and display data; 
and use simple measurement instruments." 



" ADVANCED: Superior Performance . Fourth-grade students who are performing at the 
advanced level should be able to demonstrate flexibility in solving problems and relating 
knowledge to new situations. They should be able to use whole numbers to analyze more 
complex problems. Their understanding of fractions and decimals should extend to a 
« number of representations. Students at this level should detennine when estimation or 

calculator use is an appropriate solution to a problem, as well as read and interpret 
complex graphs. Adva ced fourth-grade students should also be able to use measuring 
instruments in non-routi«.ie ways. These students should be able to solve simple problems 
involving geometric concepts and chance." 



, , ^ » " BASIC: Partial Mastery of Knowledge and Skills . The eighth-grade student performing at 

hjlglltn VlT£lCle ^ e Dasic i eve i should be able to identify and use the correct operations for solving one- 

and two-step problems involving addition, subtraction, multiplication, and division of 
whole numbers and decimals. These students should also have an understanding of place 
value and order of operations, and a conceptual understanding of fractions. They should be 



ERJC 



Page 105 



GAO/PEMD-93-12 Educational Achievement Standard* 



Appendix V 
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able to use a calculator and estimation to arrive at answers to simple problems. Basic 
eighth-grade students can use rulers to calculate the perimeter and area of rectangular 
figures, and make conversions between units o c measure within a given system of 
measurement. These students should be able to use basic geometric terms and identify 
elementary geometric figures. They should be able to read, interpret, and construct bar 
graphs and evaluate or solve simple linear equations involving whole numbers." 

" PROFICIENT: Solid Academic Performance . Students at the proficient level should be 
able, with and without a calculator, to solve problems requiring decimals, fractions, and 
proportions. They should be able to compute with integers. They should be abls to classify 
geometric figures based on their properties. Proficient eighth-grade students should be able 
to read, interpret, and construct line and circle graphs and show understanding of the basic 
concepts of probability. These students should be able to translate verbal problem 
situations into simple algebraic expressions and identify symbolic algebraic expressions 
representing linear situations." 



' ADVANCED: Superior Performance . Eighth-grade students perforrning at the advanced 
level should be able to solve, with and without a calculator, a wide range of practical 
problems involving percents, proportions, and exponents. These students should have a 
solid conceptual understanding of the interrelationships among fractions, decimals, and 
percents and their connections with proportions. Eighth-grade advanced students should 
also understand and be able to use scale drawings, metric measurements, volume, and 
accuracy of measurement. These s udents should be able to solve problems involving 
elementary concepts of probability, interpret line graphs, and apply basic geometric 
properties related to triangles and to perpendicular and parallel lines. 



TW6lfth Gr&de " BASIC: Partial Mastery of Knowledge and Skills . Twelfth-grade students who are 

performing at the basic level should demonstrate conceptual and procedural understanding 
of whole numbers, integers, fractions, and decimals and use them when solving routine 
problems. They should understand and apply measurement concepts and skills, including 
estimation, and solve routine problems involving time, money, and length. They should also 
be able to read scale drawings and use formulas to find areas and volumes. Basic 
twelfth-grade students should be able to identify a wide range of geometric figures, 
describe their characteristics, and solve problems involving angle measurements and 
similar triangles. These students should be able to interpret data in a variety of settings, 
including charts, tables, and graphs. Their understanding of chance should include the 
ability to select favorable outcomes to a situation and find the probability of an event in a 
setting involving a small number of outcomes. They should also be able to simplify and 
evaluate simple linear expressions and solve simple one-step linear equations and 
inequalities." 
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NAGB Achievement Level Descriptions: 
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" PROFICIENT: Solid Academic Performance . Twelfth-grade students who are performing 
at the proficient level should have considerable command of the use of number and 
operations involving all forms of real numbers. In particular, these students should be able 
to represent problems involving integers, decimals, and fractions using symbols or graphs. 
These students should also be able to select, interpret, and use measurement relationships 
and formulas in problem situations. They should be able to make and evaluate conjectures 
about the properties of geometric figures. Proficient twelfth-grade students should be able 
to relate data about chance to physical models and use such models to solve problems. 
These students should be able to use coordinate systems on a number line to represent 
solutions to one-variable inequalities and use ordered pairs to describe locations in the 
plane." 

" ADVANCED: Superior Performance . Twelfth-grade students who are performing at the 
advanced level should be able to investigate numerical relationships and determine the 
validity of conjectures involving number theory concepts such as parity (odd, even) and 
divisibility. These students should be able to establish procedures for the comparison and 
conversion of measurements of length, area, volume, and capacity. These students should 
understand the Pythagorean theorem and its applications, as well as use of coordinate 
geometry to represent relationships and solve problems. These students should also be 
able to graphically describe data for a situation, as well as provide numerical measures of 
central tendency (mean, median, and mode) and variability. Advanced twelfth-grade 
students should be able to apply probability and statistics concepts in reasoning about 
population characteristics based on information derived from a sample, including judging 
the adequacy of the sample. They should also be able to detennine the probability of 
dive'rse events. These students should be able to translate information about linear 
situations from verbal or tabular forms to equations and analyze, verbally or in writing, the 
nature of relationships involving change in the values of the variables involved. These 
students should also be able to solve linear equations, inequalities, and systems of two 
equations in two variables, as well as evaluate a linear function and relate the value to a 
point on a graph of the function." 
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Calculation of Patterns of Performance 



To calculate how nagb's panelists judged that students should perform on 
items of varying difficulty (the performance pattern) as shown in table 2.2, 
we listed the 137 items on the 8th grade mathematics exam in order, from 
the least to the most difficult, based on judges' estimates of the percentage 
of students at the basic level who should get the item correct. (The item 
groupings for the proficient and advanced level were the same as for the 
basic level.) We then divided the items into four groups of equal size (34 
items per quartile, except foat there were 35 in the easiest quartile) and 
computed the average "percent correct* score for each group of items for 
basic, proficient, and advanced students according to panelists' item 
judgments. The results are shown in full in figure VI. 1. 



Figure VI.1: Eighth Grade Performance 
Pattern: Item Judgments* 



Parcant Correct 
100 



80 



60 



40 



I) 




Basic laval 



Proficient laval 



Advanced (aval 




Easiest quartile of items 
Moderately easy quartile 
Moderately difficult quartile 
Most difftcult quartile 

a The percent correct is the proportion of students at a given level who should snswer items in 
each quartile correctly, according to panelists' item judgments. 

Source: National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 3. 
Technical Report (Washington, D.C.: 1991), pp. 265-71. 
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Calculation of Patterns of Performance 



To obtain the measure of actual performance shown in table 2.2, we listed 
the 61 items that were made public from the 8th grade test in order of 
difficulty, as measured by the overall percent correct score reported for 
each item. We found the quartiles of difficulty for the full set of 8th grade 
items and assigned each of the published items to the proper item group. 
We then computed the average for each group in the same manner as for 
the item judgment data. The results are shown in figure VI.2. 



Figure VI.2: Eighth Grade Ptrformance ^^^^H 
Pattern at NAEP Score for Each Level* p«c*nt correct 




Basic tavaf Proftelant laval Advancad laval 



Easiest quartile of items 
Moderately easy quartile ; 
Moderately difficult quartile 




Most difficuit quartile 



a The percent correct is the percentage of students with a NAEP score 12.5 points below to 12.5 
points above the standard for 3ach level who answered items in each quartile correctly. 



Source: National Assessment Governing Board, The Levels of Mathematics Achievement , vol. 2, 
State Results for Released Items (Washington, D.C.: 1991), pp. 5-35. 



Since the item judgment data did not identify each test item by number, we 
do not know whether the items that fell into each group in the two sets of 
data were exactly the same or not However, we know that the item 
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judgments were highly correlated with actual item difficulty — that is, that 
judges could distinguish easier from more difficult items. Thus, we expect 
that differences between the two sets of iter l groups we likely to be 
minor. 
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